In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 2. 15. · Hussein Hassan Harrirou Sunday 31st March, 2019 Supervisor: Dr. Thomas Lemmin, Prof. Ce Zhang

Research Collection

Master Thesis

Neural networks for improving drug discovery e fficiency

Author(s): Hassan Harrirou, Hussein

Publication Date: 2019-03

Permanent Link: https://doi.org/10.3929/ethz-b-000337776

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

https://doi.org/10.3929/ethz-b-000337776

http://rightsstatements.org/page/InC-NC/1.0/

https://www.research-collection.ethz.ch

https://www.research-collection.ethz.ch/terms-of-use

Neural networks for improving drugdiscovery efficiency

Master Thesis

Hussein Hassan Harrirou

Sunday 31st March, 2019

Supervisor: Dr. Thomas Lemmin, Prof. Ce Zhang

Department of Computer Science, ETH Zurich

Acknowledgements

I would like to thank everyone who helped and supported me through thelast semester of my Master’s degree.

I want to express my utmost gratitude to my parents, Ala and Najat, withoutwhom I could never have gotten the possibility to do my Master’s degreeabroad and join ETH Zurich, and who gave me the unconditional love andsupport that I needed while I studied abroad.

I want to show my gratitude to my supervisor Thomas Lemmin, for showinggreat patience and constant support by filling my gaps in knowledge aboutbiochemistry and how to work systematically on my experiments. Manythanks to Prof. Ce Zhang for his ideas and help figuring out my mistakesduring the development of the neural networks, and for helping me realizethat I do not know as much about machine learning as I thought I did.Thanks to all the members of the department who helped me by sharingtheir computational resources, without your help I couldn’t have made it tothe deadline.

I would like to thank my friends, who listened to my incessant complaintsand helped me stay sane for the last six months. Special thanks: to DavidMartınez Rubio, Jeniffer Zhou and Nikolai Nikolov for their ideas on how toimprove and fix the neural network architectures; to Asuman Inan, XinyueYao, and Eric Garcıa de Ceca who helped proofreading and correcting thiswork, and supported me throughout the thesis; to Asuman Inan, Eric Garcıade Ceca, Xinyue Yao, David Martınez Rubio, Jay Li, Kenichi Furuya, RikushiYasumatsu, Georgiana Birjovanu, Xi Chen, Renjie Cui, Yaqi Zhao, Qiao Tang,Risako Miyake and Benedikt Petko for their incredible support during thehardships of this last half year.

i

Abstract

The increase in costs and needs of new drugs has led to the wide useof computational methods for drug discovery. Among the differentmethods, Machine-learning based methods have demonstrated to havea good efficiency/accuracy trade-off. Here we explore how to improvethe binding affinity prediction, a measurement for drug binding effec-tiveness. To do so, we base our development on the previous Convolu-tional Neural Network state-of-the-art approach, KDeep, and researchhow to improve the neural network architecture and also use differ-ent architectures, such as ResNet. We also explore the performance ofmore physically meaningful energy-based features, derived from theRosetta force field, APBS electrostatics, and electronegativity maps. Weexamine how these features can be represented in 3D space with differ-ent types of filters and interpolations, and compare their performance.Finally, we perform data manipulation and augmentation to improvethe generalization capabilities of the neural networks by applying localrelaxation of the protein-ligand structures and increasing our datasettenfold with molecular dynamics simulations. We have shown thatenergy-based features are able to represent the information needed forgood pK binding affinity prediction, achieving 1.36 RMSE on PDBBindcore set (v.2016) for Rosetta and electronegativity features. When com-bined with HTMD features, we lowered the RMSE down to 1.32. Inaddition, we achieved better predictions for the test datasets of CSARHiQ and CSAR 2014 and overall better generalization, with very similarPearson’s correlation coefficients for the validation and test datasets. Fi-nally, we have produced an extensive dataset of more than 48 000 posesfrom the 4463 complexes offered in PDBBind v.2018, which can be usedfor further studies.

Keywords: Binding affinity, dissociation constant (Kd), inhibition con-stant (Ki), Convolutional Neural Networks, Rosetta, ResNet, MolecularDynamics, PDBBind, Drug Discovery

ii

Contents

Contents iii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Binding affinity . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Convolutional Neural Network architectures . . . . . . 3

1.3 Previous results . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Materials 72.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methods 93.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Molecular descriptors . . . . . . . . . . . . . . . . . . . 103.1.2 Electrostatic charges . . . . . . . . . . . . . . . . . . . . 103.1.3 Force-field energies . . . . . . . . . . . . . . . . . . . . . 123.1.4 Atom identification . . . . . . . . . . . . . . . . . . . . . 14

3.2 3D representation . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.1 Getting the input data . . . . . . . . . . . . . . . . . . . 193.4.2 Relaxing the ligands into the protein . . . . . . . . . . 193.4.3 Autodock Vina formatted files . . . . . . . . . . . . . . 203.4.4 Preprocessing Rosetta pointwise energies . . . . . . . . 203.4.5 Computing features . . . . . . . . . . . . . . . . . . . . 203.4.6 Merging the features . . . . . . . . . . . . . . . . . . . . 213.4.7 TFRecords for neural network input . . . . . . . . . . . 213.4.8 Augmenting the data - Molecular Dynamics simulation 22

iii

Contents

4 Results 234.1 Prediction results . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Original KDeep - HTMD features . . . . . . . . . . . . 244.1.2 Larger filter KDeep . . . . . . . . . . . . . . . . . . . . . 264.1.3 ResNet-101 . . . . . . . . . . . . . . . . . . . . . . . . . 284.1.4 Comparison of results with other networks . . . . . . . 31

4.2 Representation results . . . . . . . . . . . . . . . . . . . . . . . 314.3 Data augmentation results . . . . . . . . . . . . . . . . . . . . . 34

5 Discussion 39

6 Conclusions 43

Bibliography 45

Appendices 51

A Dataset complexes 53A.1 Training data: PDBBind 2018 Refined set . . . . . . . . . . . . 53A.2 Validation data: PDBBind 2018 Core set . . . . . . . . . . . . . 58A.3 Test set: CSAR HiQ . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.3.1 Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58A.3.2 Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.4 Test set: CSAR 2014 . . . . . . . . . . . . . . . . . . . . . . . . . 58

B Core set clusters 59

C Pipeline scripts 61C.1 Rosetta relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . 61C.2 Precompute Rosetta energies . . . . . . . . . . . . . . . . . . . 68C.3 CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 71C.4 Feature generation . . . . . . . . . . . . . . . . . . . . . . . . . 75C.5 PDBQT generation . . . . . . . . . . . . . . . . . . . . . . . . . 92C.6 Molecular Dynamics simulation . . . . . . . . . . . . . . . . . 95

iv

Chapter 1

Introduction

1.1 Motivation

In 1928, the antibiotic penicillin was discovered by Alexander Fleming[1].Shortly after the spread and mass production of the drug, the first case ofpenicillin-resistant Staphylococcus aureus was found[8]. Since then, the useof antibiotics has sky-rocketed as well as the cases of multi-drug resistantbacteria[27]. This continued trend calls for the constant development of newdrugs effective against previously treatable bacteria.

The drug development industry by pharmaceutical companies is one of thecostliest. The investment required to develop a new drug was estimated tobe of $ 1.4 billion on R&D alone[12].

The growing necessity for new drugs and the increasing costs for their re-search and development motivate the use of mathematical and computa-tional methods that could narrow down the search space. The drug searchspace is very large, with an estimate of more than 106[14] possible protein-ligand complexes. The Protein Database (PDB) is a database for proteins,nucleic acids, and other biological molecular 3D structures and their infor-mation. In it, there are already more than 144.000 entries, and in recentyears around 10.000 more are added every year[45]. It is clear then thatusing neither experimental testing nor time-intensive computational meth-ods are reasonable approaches to identify well performing combinations ofproteins and drugs.

Computer-aided drug discovery has been approached as a set of sub-problems.Virtual Screening (VS) consists in searching large data-sets of small moleculesfor potential ligands that would properly bind to the target protein. One ofthe search methods is molecular docking, in which a ligand is spatially ma-nipulated in order to find the binding conformation to the protein with thelowest energy[28]. Binding Affinity Prediction can then be used to com-

1

1. Introduction

pare and rank the binding strength between the target protein and potentialdrugs by using several constants derived from molecular mechanics.

Binding affinity can be predicted with a few different methods. The moreprecise techniques relying on Quantum Mechanics[16] or Thermodynamicintegration[32] are computationally expensive, which limits their large-scaleusability. Machine-learning approaches offer a faster alternative; however,their accuracy strongly depends on the adequate selection of features andmodels.

In this project, we investigated the use of Deep Learning to improve thebinding affinity prediction of protein-ligand complexes. The current state-of-the-art system, KDeep[24], uses a group of features classified as molecu-lar descriptors, indicator functions of certain subsets of atoms with certainproperties. We explore the performance of energy-based features such as theones from all-atom energy functions like Rosetta[4] or electrostatic energiesbased on the Poisson-Boltzmann equations[17]. We also compare differentfilters and interpolations to represent these features in a 3D grid. We explorethe performance of two different convolutional neural network architectures:KDeep (a variation of SqueezeNet[23]) and ResNet[21]. Finally, we explorethe change of performance when augmenting our dataset by introducingchanges on the proteins’ structures via molecular dynamics simulations.

1.2 Background Knowledge

In this section we explain two concepts required to understand the develop-ment of the project. First, we glance through the definition of binding affin-ity and its chemical significance. Secondly, we explain the building blocksof the neural network architectures used later in this work, KDeep[24] (withits underlying SqueezeNet[23] architecture) and ResNet[21].

1.2.1 Binding affinity

Binding affinity represents how well a protein and a ligand bind when com-bined in a solution. This binding can be represented by the following reac-tion, where E is the protein, S is the ligand and I is a potential inhibitor.

E + Skf−−⇀↽−−kr

E · S

E + Ikfi−−⇀↽−−kri

E · I

Binding affinity can be understood as the difference of energy between theunbound and bound complex. This is expressed by the Gibbs free energy atstandard conditions, δGo.

2

1.2. Background Knowledge

Binding affinity is also commonly measured by other values related to ∆Go,such as the dissociation constant (KD), the inhibition constant (KI) or thehalf maximal inhibitory concentration (IC50).

KD = [E][S]/[ES] = exp(∆Go/RT)KI = [E][I]/[EI]

IC50 = KI(1 + [S]/KM) + [E]0/2

In many binding affinity databases only one of them appears per complex,and regardless of their differences, many prediction approaches will con-sider them as a generic binding affinity constant.

1.2.2 Convolutional Neural Network architectures

In this project we employ two Convolutional Neural Networks (CNN), SqueezeNetand ResNet, to predict the binding affinity.

SqueezeNet was designed to be a compact version of the AlexNet architecture[30].Both AlexNet and ResNet were originally developed to classify natural im-ages from the ImageNet[11] dataset, but have been widely used in manyclassification and regression problems in many fields, including problemswith 3D images such as 3D pose prediction[36] and MRI detection[29].

Both of these architectures are deep convolutional neural networks designedby sequentially composing different modules. SqueezeNet defines two-layeredmodules called “fire modules”. The first layer of the module is the squeezelayer, a convolutional layer with filter size 1x1 and small number of filters.This squeeze layer is supposed to constrain the network to a more compactrepresentation in order to find the features that are most important and re-ject the noisier ones. The second layer is the expand layer, composed bytwo side-by-side convolutional layers, one of shape 1x1 and the other withshape 3x3, with a larger channel count than the squeeze layer. This layeris dedicated to performing transformations over the “squeezed” represen-tation. These side-by-side convolutions are then concatenated to form anoutput with a much larger channel count. Each fire module maintains thedimensions of the images, and either maintains or increases the number ofchannels. The overall network is composed by groups of fire modules, withmax pooling layers between groups to reduce the image dimensionality af-ter each block. Finally, the output of the last block is averaged and passedthrough a fully connected layer. A graphic representation of the fire moduleis shown in Figure 1.1.

ResNet follows a similar idea with its “residual modules”. In ResNet’s case,the module has three layers, with filter shapes 1x1, 3x3 and 1x1 respectively.Residual modules also include skip connections, following the idea that the

3

1. Introduction

Squeeze layer1-width convolution

Expand1-width & 3-width convolutions

Concatenate all layers

Figure 1.1: SqueezeNet’s “Fire module”

1-width convolution 3-width convolutions 1-width convolution Shortcut connection: add the input

+

Figure 1.2: ResNet’s “Residual module”

modules shouldn’t learn all attributes of the data, but only the perturbancesbetween input and output. Residual modules, in the same way as fire mod-ules, do not reduce the size of the images, and ResNet composes them inblocks separated by max pooling layers. There are also skip connectionsbetween contiguous block, skipping the separating max pooling layers. Agraphic representation of a residual module is shown in Figure 1.2

1.3 Previous results

Machine-learning approaches to binding affinity prediction have becomevery popular in recent years. Although their accuracy is still not comparableto more costlier methods, such as the ones based on simulations[18], the sizeof the drug search space makes speed outweigh precision as a bottleneck.

4

1.4. Objective

A plentiful variety of classical machine-learning algorithms have been ap-plied, from linear regressions[31, 46], kernel ridge regression[44, 40], sup-port vector machines[41, 46], Gaussian processes[31] to random forests[41,34, 44]. In this regard, model choice seems to favor random forests, with thebest performing example of RF-Score[34].

Until recently, the best performing machine-learning approach was that ofthe random forest RF-Score, which relies on 42 molecular descriptors ex-tracted from Autodock Vina[43]. RF-Score reported an RMSE of 1.513 forpKD prediction on their test set (PDBBind 2007 core set), whereas [24] re-ports an RMSE of 1.39 after training with PDBBind[35] 2016 refined minuscore set, and testing with the core set.

On the other hand, neural networks and specifically deep learning, hasshown appearance and promising results that improve on RF-Score’s results.Various examples are KDeep[24], TopologyNet[6], DLScore[20].

KDeep, the network we base our project in, focuses on generating 3D mapsof different molecular descriptors of the atoms of both protein and ligand,in a bounding box of 24 Angstrom centered at the ligand’s center of mass.These resulting 16 maps are then input for a simplified SqueezeNet. KDeepreports an RMSE on the PDBBind 2016 core set of 1.27 for pKD prediction.

TopologyNet relies on computing topologically invariant features (i.e. Bettynumbers) of the graph of heavy atoms and bonds at different bonding dis-tances and computing so called topological barcodes, which are inputtedinto a 1D-convolutional neural network with additional geometrical data.TopologyNet reported an RMSE of 1.37 for pKD prediction when trainingwith PDBBind 2016 refined minus core set and testing with the core set.

DLScore counts the closeness between atoms from the protein and the ligandin buckets depending on the atom types. It also calculates the electrostaticenergy of the residues. Finally, all these features are added to the Vina fea-tures and inputted to a deep neural network to predict the result. DLScorereports an RMSE of 1.15 for ∆Go prediction, when trained with PDBBind2016 refined minus core set and tested with the core set.

1.4 Objective

In this project, we tried to answer the following questions:

1. How can we incorporate physically meaningful features like electro-statics from APBS[25] and force-field from Rosetta[4]?

2. How should these features be represented in 3D, and how do newfeature representations perform compared to the new proposals?

5

1. Introduction

3. Can we find a neural network architecture that performs better thanKDeep?

4. Does data augmentation via molecular dynamics simulations help theneural network to learn to predict better?

6

Chapter 2

Materials

2.1 Datasets

In this section we introduce the datasets used to train, validate and test themachine learning models.

For training and validation, the PDBBind dataset[35] is used. More con-cretely, we use the “refined set” subset for the dataset version released in2018. The “PDBBind 2018 general set” version contains 19588 biomolecu-lar structures, for which some binding affinity data (i.e. Kd, Ki, IC50) hasbeen experimentally recorded. Furthermore, the “refined set” contains 4463selected complexes after applying certain curation filters. Finally, a 290 com-plex subset of the “refined set” called “core set” is used for validation. The“core set” was constructed by clustering the complexes of the “refined set”into 58 clusters by a 90% similarity cutoff of the protein sequence and tak-ing 5 examples for each cluster, in order to evenly represent the diversityof the complexes. In our case, due to changes in PDBBind v.2018, only 271complexes from the “core set” are currently classified as part of the “refinedset”.

For testing, we used the 2010 CSAR-HiQ[15] and 2014 CSAR[7] datasets.These datasets are filtered to contain only the complexes not already consid-ered during training and validation. The former is given as two separatesets, CSAR-HiQ Set 1 and CSAR-HiQ Set 2. After filtering, CSAR-HiQ Set 1contains 55 complexes, CSAR-HiQ Set 2 contains 49 complexes, and CSAR2014 contains 47 complexes.

In the end, the training dataset consists of 4192 complexes, the validationdataset of 271 complexes, and the test datasets of 201 complexes. A list ofthese complexes appears in Appendix A

It is important to note that this division between training, validation andtesting is common but not unique, and other authors have tested other

7

2. Materials

approaches[44], such as a time based split (in which the validation andtesting subsets are formed by the newly acquired experimental results inorder to simulate the real discovery procedure), structure based (separatethe dataset according to substructure similarity) or stratified random sam-pling (a split that ensures that in each of the subsets the full range of valuesis represented).

8

Chapter 3

Methods

To improve the predictive behavior of the current applied methods, the fol-lowing approaches were considered:

• Identifying features that describe well the concepts related to bindingaffinity.

• Finding a 3D representation for those features, to be used with convo-lutional neural networks.

• Modifying the neural network architecture to improve generalization.

3.1 Features

In the field of biochemistry, proteins and ligands are commonly given in afew raw representations, such as single-line strings indicating atoms, molec-ular motifs and bonds (i.e. SMILES or FASTA), or full 3D representations ofatom coordinates (i.e. PDB), with bonds (i.e. MOL2).

These representations fail to satisfy one of the preferable properties of inputsto most machine-learning algorithms: a fixed-size structure. Furthermore, itis unclear if directly using these raw features is enough to represent anyattribute interesting for our purpose of predicting binding affinity.

Thus, a great deal of research has focused on obtaining advanced represen-tations developed from the previously mentioned raw ones. Overall, threetypes of attributes seem to dominate the representation landscape: topolog-ical features, based on the atoms and their bonds; energetic features, basedon the charges and energies of the atoms and their bonds; and statisticalfeatures of both raw data and the previous features.

Examples of different uses of statistical features of atom types and smallsubstructures [31, 46], physicochemical markers[31, 41, 46], and many en-ergy features such as Coulomb matrices[40], and Gasteiger charges[24].

9

3. Methods

In this project, we have focused on three different feature sets:

• Molecular descriptors

• Electrostatic energies

• Force-field energies

3.1.1 Molecular descriptors

Molecular descriptors indicate, as binary maps, certain properties of theatoms in a molecule or protein.

Following KDeep’s[24] procedure, we used the molecular descriptors of-fered by the Python library HTMD[13], which are defined in terms of theatom types of the software Autodock Vina 4[43]. In it, eight different de-scriptors are defined, as shown in Table 3.1.

Table 3.1: HTMD feature maps, table extracted from [24]

Channels AtomsHydrophobic Aliphatic or aromatic carbons

Aromatic Aromatic carbonsHydrogen bond acceptor Nitrogen, oxygen and sulphur acceptors

Hydrogen bond donor Donor hydrogens bonded to oxygen or nitrogenPositive ionizable Atoms with positive Gasteiger charge

Negative ionizable Atoms with negative Gasteiger chargeMetallic Magnesium, Zinc, Manganese or Iron

Occupancy All atoms

The intuition behind these descriptors is that a machine-learning algorithmshould be able to learn the relation between different atom types that interactcreating regions of high or low energy, making the bonds between proteinand ligand stronger or weaker.

These descriptors, as defined before, only represent the type of single atoms,and do not extend intuitively to a 3D model. In fact, HTMD’s procedureshave implemented a voxelization method, but such methods will be dis-cussed in section 3.2.

These molecular descriptors are computed for both protein and ligand, re-sulting in 16 feature maps per complex. Two examples of the these featuremaps are shown in Figure 3.1 and Figure 3.2.

3.1.2 Electrostatic charges

The electrostatic interactions of the complex should give some informationon the interaction between the protein and ligand. The Poison-Boltzmann

10

3.1. Features

Figure 3.1: HTMD Aromatic channel for protein in complex 10GS

Figure 3.2: HTMD Positively charged channels for protein and ligand incomplex 10GS

11

3. Methods

∇2φ = c0e · [exp(−eφ

kBT)− exp(

eφ

kBT)]

Figure 3.3: Poisson-Boltzmann Equation

equation[17], describing the charge distribution between solution and a chargedsurface, is used to compute these electrostatics. Figure 3.3 shows the equa-tion.

We use the program Adaptative Poisson-Boltzmann Solver[25], designedspecifically for biomolecular settings.

With APBS, we compute three electrostatic energy maps: one for the energydistribution of the protein alone, a second for the distribution of the ligandalone, and a third for the distribution when both protein and ligand coexist.This combination of maps should have some information about the differ-ence in energy before and after binding, which is related to our target value,the binding affinity of the complex.

These maps are given as 3D voxelized grids by APBS and need no furthertreatment.

3.1.3 Force-field energies

Force-fields are functions approximating the potential energy of an atomicsystem in terms of interactions between the different atoms.

Much work has been previously done on force-fields to predict bindingaffinity and docking of complexes. Specifically on binding affinity, most ap-proaches approximate the change of Gibbs free energy and use that to com-pute the binding affinity constant. These approaches consider only system-wide (statistical) properties derived from the force-fields. The novelty addi-tion of this project to the already explored approaches of other authors isthe use of force-fields as derived 3D features of the complexes. In the 3D set-up, many takes on the problem of predicting binding affinity have implicitlyused force-fields on the step of ligand docking, in which the position of theligand with respect to the protein is optimized according to minimizing theenergy of that certain force-field.

We instead use a force-field to generate maps between atoms and the dif-ferent energies and attributes that the force-field offers, which will later bevoxelized accordingly.

We focus on the well-performing Rosetta all-atom force field. In particular,we explore the attractive, repulsive, electrostatic and solvative energies be-tween pairs of non-bonded atoms that the Rosetta force-field offers. Thesefeatures are computed using the Rosetta framework[33] and PyRosetta.

12

3.1. Features

1 2 3 4 5 6−0.2

−0.1

0

0.1

0.2

(a) Attractive energy

1 2 3 4 5 6−0.2

−0.1

0

0.1

0.2

(b) Repulsive energy

1 2 3 4 5 6−2

−1.5

−1

−0.5

0

0.5

(c) Electrostatic energy

1 2 3 4 5 6

0

0.2

0.4

0.6

(d) Solvative energy

Figure 3.4: Plots of a single contribution of the different energies obtainedfrom Rosetta. Interaction pair C-N

An example of the distribution of the energy with respect to distance be-tween two atoms is shown in Figure 3.4.

Emulating the separation of positive and negative charges in HTMD, wepreprocess these Rosetta features to generate 6 maps from the 4 force maps:attractive, repulsive, positive electrostatic, negative electrostatic, positive sol-vative, negative solvative. We generate these 6 maps for both protein aloneand ligand alone, summing up to 12 maps in total. These maps will havesome representation applied (discussed in section 3.2) and then normalizedto the range 0 to 1 to match the range of the HTMD features.

Other force-fields are implicitly used in other parts of our pipeline. CHARMM[5]is used in the data augmentation via molecular dynamics simulations. AMOEBA[42]and AMBER[9] were also explored as possible alternatives for Rosetta.

13

3. Methods

Figure 3.5: Electronegavity channels for protein and ligand in complex 10GS

3.1.4 Atom identification

One approach derived from the idea of molecular descriptors is to havechannels indicating the elements of the atoms. Due to the high number ofelements that take part in our complexes, it is unfeasible to introduce 15+more channels just for this. Our take on this is to create one map for theprotein and one for the ligand, in which we distribute spatially the elec-tronegativity of each atom. These values should bear enough informationfor the neural network to identify the atoms.

To get these electronegativity values we use the Python library Mendeleev[3].The values are then distributed following the same procedure applied to theforce-field features.

As with the rest of the maps, the electronegativity maps are also normalizedto the range 0 to 1. An example of the electronegativity maps is shown inFigure 3.5.

3.2 3D representation

With the advent of Deep Learning and the use of Convolutional Neural Net-works in computer vision, so have the applications of CNN’s to machine-learning based biochemistry. The inherent 3D structure of proteins and lig-ands intuitively leads to the use of CNN’s, capturing local properties of avoxelized complex. In order to use a CNN, we need to represent the datain a fixed-size 3D shape. We apply a common approach of creating a cubic

14

3.3. Deep Learning

Filter FormulaGaussian exp(−( r

rvdw)2)

Inverse exponential 1− exp(− rvdwr

12)

Table 3.2: Filter equations applied to pointwise features

Interpolation RBF FormulaLinear r

Gaussian exp(−(r/ε)2)Thin plate r2 ∗ log(r)

Table 3.3: Interpolations applied to pointwise features

evenly-spaced grid of 24 Angstrom of length centered around the center ofmass of the ligand.

For both the Autodock Vina molecular descriptors obtained through HTMDand the APBS electrostatic fields are already represented in 3D voxels.

For HTMD, voxelization is done by applying a filter for each binary maparound each active atom. The filter has the following expression:

f (r) = 1− exp(−( rvdw

r)

12)

where r is the distance from the atom center to the voxel center and rvdw isthe Van der Waals radius of the atom.

Part of the project involves finding a good way to voxelize these pointwisefeatures. Two approaches were tested: applying a filter like the previousone, and applying interpolation algorithms.

In Figure 3.2 we list the filters that were tried to distribute the pointwise val-ues in the 3D grids. Likewise, in Figure 3.3 the tested SciPy[2] interpolationsare listed.

In Figure 3.6 the spatial distribution of the filters is shown. In Figure 3.7 theexample of the respective distributions are shown.

3.3 Deep Learning

Over the last few years, the focus on applying machine learning algorithmsto the problem of drug discovery has increased considerably. Many mod-els using classical methods such as linear regression[31, 46], kernel ridgeregression[44, 40], support vector machines[41, 46], Gaussian processes[31]

15

3. Methods

−6 −4 −2 0 2 4 6

0

0.2

0.4

0.6

0.8

1

(a) Gaussian filter

−6 −4 −2 0 2 4 6

0

0.2

0.4

0.6

0.8

1

(b) Inverse exponentialfilter

Figure 3.6: Plots of the density distribution of Gaussian and inverse expo-nential filters

(a) Gaussian filter (repul-sive map of 10GS’s pro-tein)

(b) Inverse exponentialfilter (repulsive map of10GS’s protein)

Figure 3.7: Gaussian and inverse exponential filters applied

16

3.3. Deep Learning

or random forests[41, 34, 44] have been used as various approaches to pre-dict the binding affinity of protein-ligand complexes. Until recently, the bestperforming machine-learning approach was that of the random forest RF-Score[34], which relies on 42 molecular descriptors extracted from AutodockVina.

We developed on top of KDeep’s[24] convolutional neural network, whichachieved a lower validation error on PDBBind’s core set.

KDeep is based on SqueezeNet[23]. We have produced two versions ofKDeep: the first one is the most faithful reproduction of the original KDeepnetwork, matching the parameter count reported; the second is a modifica-tion of the former by increasing the size of the first convolution filter from1x1x1 to 7x7x7.

The structure of the first version, which we call “Original KDeep”, has thefollowing structure:

Layer Output size Filter size Stride Squeeze ExpandInput 25x25x25@29conv1 12x12x12@96 1x1x1 2x2x2fire1 12x12x12@128 16 64fire2 12x12x12@128 16 64fire3 12x12x12@258 32 128

maxpool1 6x6x6@258 3x3x3 2x2x2fire4 6x6x6@258 32 128fire5 6x6x6@384 48 192fire6 6x6x6@384 48 192fire7 6x6x6@512 64 256

avgpool1 3x3x3@512 3x3x3 2x2x2flatten 13824dense 1

The structure of the second version or “Large-filter KDeep”, has the samestructure as the previous one, except the filter size of layer “conv1” is 7x7x7.

We have also used ResNet, with its original specification in [21]. We use aResNet-101 with the following structure:

17

3. Methods

Layer Output size Filter size Stride Inner dimension CopiesInput 25x25x25@29conv1 12x12x12@64 7x7x7 2x2x2

maxpool1 6x6x6@64 3x3x3 2x2x2inner-res-0 6x6x6@256 64 3start-res-1 3x3x3@512 128inner-res-1 3x3x3@512 128 3start-res-2 2x2x2@1024 256inner-res-2 2x2x2@1024 256 23start-res-3 1x1x1@2048 512inner-res-3 1x1x1@2048 512 2

flatten 2048dense 1

The loss commonly used in binding affinity prediction is the RMSE:

RMSE =

√∑n

i=1(yi − yi)2

n

where yi is the the predicted binding affinity and yi is the true bindingaffinity.

In our KDeep architectures, all convolutional layers are followed by a rec-tifier linear unit (ReLU). Other activation functions were tried (ELU, SELU,Leaky ReLU), but no significant difference in performance was found. Onour ResNet architecture, the first convolutional layer is followed by a ReLU,whereas the rest of the convolutional layers, used in the residual modules,are preceded by a ReLU, as is recommended in [22]. Neither Batch Normal-ization (BN) nor Dropout were used, as in all of our experiments the per-formance was considerably worse when these types of regularization wereapplied.

For training, we use the AdaM optimizer[26] with its default parametersand a learning rate of 10−4. We train the networks for 100 epochs using abatch size of 128. The variables are initialized with the default settings ofTensorflow, which corresponds to the Glorot uniform initializer[19].

The implementation of these neural networks is available in Appendix C.

3.4 Pipeline

In order to make the project fully reproducible and easily extensible, a mod-ular pipeline was developed for this project. The vast amount of file formatsand intermediate steps in the generation of our features made the systemquite complex. In this section, we will explain the steps taken from raw data

18

3.4. Pipeline

until it is inputted to the CNN. All scripts mentioned in this section appearsin Appendix C.

3.4.1 Getting the input data

Our initial dataset is PDBBind refined set v.2018. It can be downloadedfrom [35] as a compressed .tar.gz file. Inside it, there is a folder per com-plex, e.g. 10gs. In 10gs we will have some files, of which we will focus on10gs protein.pdb and 10gs ligand.mol2. The .pdb file contains 3D atom po-sitions, as well as atom type, residue name and protein chain. The .mol2 filecontains atoms positions with their type and explicit bonds between them.

For the sake of simplicity, we will keep using 10gs as an example.

3.4.2 Relaxing the ligands into the protein

The PDBBind dataset was obtained by X-Ray crystallography. This tech-nique allows for the location of atoms with considerable electron density(usually having trouble identifying hydrogen atoms[37]). Because of theinteraction between the X-Rays and the electrons of the complex, exact po-sitions may not be very realistic due to ionization effects and the likes[10]. Toamend this, we apply a “relaxation”[38] step using the Rosetta Commons[33]framework. This relaxation process minimized the target energy (in ourcase, Rosetta’s energy function) by moving the atoms in the volume of 20Angstroms surrounding the ligand’s center of mass. This distance is selectedto have relaxed atoms for any rotation of the 24 Angstrom side-length cubecentered at the ligand center of mass. Water molecules located far awayfrom the protein pocket (i.e. waters not at distance 3 Angstrom of the pro-tein nor the ligand) are removed to accelerate the relaxation. In cases whereions are in contact with the ligand, i.e. ions that are at a distance of at most2.5 Angstrom of heavy ligand atoms such as oxygen or nitrogen, we applyconstraints on the relaxation to maintain this maximum distance, so as notto have these important ions diverge outside the focused area.

To begin the relaxation, first we need to compute the .params and .pdbof the ligand, which we can obtain by executing make ligand pdb params.This will produce file 10gs ligand.params and 10gs ligand.pdb.

After this, we will need to combine protein and ligand .pdb files withmake complex pdb. Here, the protein is protonated with PropKa[39] viaHTMD[13] to recover the unidentified hydrogens, and residue names arenormalized to their standard versions (i.e. HIE to HIS, AR0 to ARG, etc).This produces 10gs complex.pdb.

With this, we can relax the complex by executing minimize rosetta. This willpotentially generate 10 different randomly generated poses, in files of the

19

3. Methods

form 10gs complex 00XX.pdb, where XX ranges from 01 to 10. This will alsogenerate file score.sc, where the metadata of the poses (such as the energyscore) is stored. From these poses, we will take the one with the least energy.We can set aside the other poses by executing hide non minimal complexes.This will move the other poses to an auxiliary folder, leaving only one file10gs complex 00XX.pdb on the 10gs folder. We shall recover the .mol2 filewith bonds for the new pose by executing make ligand mol2 renamed andmake ligand mol2. This creates 10gs ligand 00XX.mol2. We shall also re-cover the .pdb file for the protein, by executing make protein pdb. Afterthis step we should have three files, with names 10gs complex 00XX.pdb,10gs protein 00XX.pdb and 10gs ligand 00XX.mol2.

3.4.3 Autodock Vina formatted files

Until now, all of our files were essentially slight transformations of the orig-inal .pdb and .mol2 files. To compute HTMD features, we will need to applya final transformation. We need to produce, from our 10gs protein 00XX.pdband 10gs ligand 00XX.mol2, two files in PDBQT format, i.e. 10gs protein 00XX.pdbqtand 10gs ligand 00XX.pdbqt, respectively. These files contain Gasteiger chargesand a finer atom classification, with special types for carbons in benzenes(aromatic hydrocarbons), hydrogen bond acceptor versions for oxygen andnitrogen, and so on.

To generate these .pdbqt files, we will execute make pdbqt.

We now have the needed files to compute the molecular descriptor featureswith HTMD.

3.4.4 Preprocessing Rosetta pointwise energies

In order to compute both electrostatic features with APBS and Rosetta fea-tures, we will need to precompute the radii, charges and forces of the com-plexes’ atoms.

To do so, we will run compute rosetta energy, which produces a file named10gs complex 00XX.attr.npz, a Numpy compressed file with the data neces-sary to generate dictionaries for atom’s radius, charge and Rosetta forces.

3.4.5 Computing features

Molecular descriptors - HTMD

To compute the molecular descriptor maps, we will use the script make htmd features.This will take the previously generated .pdbqt files, and produce a file10gs complex 00XX.hdf5, which contains a grid with 16 maps. This gridcan be accessed by key ’grid’ in the HDF5 structure.

20

3.4. Pipeline

Electrostatics - APBS

To compute the electrostatic maps, we will use the script make apbs features.This will use 10gs complex 00XX.pdb and 10gs complex 00XX.attr.npz, andgenerate a file 10gs complex 00XX.hdf5. This file will contain a grid with 3maps, one for the electrostatics of the protein, one for the ligand and one forthe whole complex. Again, this grid is stored with key ’grid’ in the HDF5.

As an intermediate step, the script generates .pqr files for protein, ligandand complex, which store atom coordinates with their charges. These filesare input for APBS. APBS generates the grid results in .dx format, whichour script transforms to HDF5.

Force-field energies - Rosetta

To compute the Rosetta energy maps, we will use script make rosetta features,which requires both 10gs complex 00XX.pdb and 10gs complex 00XX.attr.npz.It will produce 10gs complex 00XX.hdf5, with 4 maps, one per energy, storedwith key ’grid’ in the HDF5 file.

Electronegativity - Mendeleev

To compute the electronegativity maps, we use the script make electroneg features,requiring only 10gs complex 00XX.pdb. It produces 10gs complex 00XX.hdf5with 2 maps, one for protein atoms and one for ligand atoms, stored withkey ’grid’ in the HDF5 file.

3.4.6 Merging the features

In this step, all the HDF5 files are merged to generate a single HDF5 file percomplex. This can be achieved by executing script make supermap.

3.4.7 TFRecords for neural network input

In this last step, we take the HDF5 files, read the grids, find the target valuefor each complex, and create TFRecords of the recommended 100MB size.

Before creating these TFRecords, we have the option to split the dataset intotraining and validation sets. We can either randomly split by giving a ratioof how many complexes should be collected for training; or we can splitby setting the validation set as the complexes of the PDBBind core set andleaving the rest for training.

To do this, we call script make tfrecords. To choose between random split-ting and core set splitting, we use options –split or –core, respectively.

21

3. Methods

3.4.8 Augmenting the data - Molecular Dynamics simulation

To increase our training data, we will use simulations to let the proteinchange shape. For this, we will use OpenMM, a molecular mechanics soft-ware.

We run a simulation of 2ns of duration, with 200.000 steps of 1fs, takingsnapshots every 20.000 steps, or every 20ps.

As input, we need a .pdb and a .psf file for the protein. We can use10gs protein.pdb and generate 10gs protein.psf by running make protein psf.

OpenMM will generate a .dcd file, which can be used to generate 10 protein.pdb file. Each of these .pdb files, together with the ligand .mol2 file, can beprocessed with the previous steps as any other complex. To run OpenMM,we need to execute molecular dynamics.

22

Chapter 4

Results

4.1 Prediction results

In this section we show the different experiments and the results obtained.We commence with our reproduction of the KDeep[24] experiment. Wethen show the results obtained with the large-filter KDeep and ResNet-101architectures when using the following combinations of feature maps:

• HTMD features

• Rosetta features

• Electronegativity features

• APBS features

• Combination of features:

– HTMD + Rosetta

– HTMD + electronegativity

– Rosetta + electronegativity

– HTMD + Rosetta + electronegativity

The representation used for Rosetta and electronegativity features in thissection is the inverse exponential filter.

A summary of the results appears in Figure 4.1 for the validation errors,Figure 4.2 for the test errors on CSAR HiQ Set 1, Figure 4.3 for test errorson CSAR HiQ Set 2 and Figure 4.4 for the test errors on CSAR 2014. InFigure 4.5 and Figure 4.6 the results for Pearson’s R correlation are shownfor the validation set and the average for the test sets, respectively. We alsoshow the Pearson’s R per cluster in Figure 4.7 and Figure 4.8 for the best

23

4. Results

1.27 baseline[24]

HTM

D

APB

SA

PBS

ENR

oset

ta+E

NR

oset

taH

TMD

HTM

D+E

NH

TMD

+Ros

etta

+EN EN

HTM

D+R

oset

taR

oset

taH

TMD

+EN

Ros

etta

+EN

HTM

DH

TMD

+Ros

etta

HTM

D+R

oset

ta+E

N

1

1.2

1.4

1.6

1.8

2

1.32

1.33

1.341.3

61.3

61.41.431.4

51.4

51.47

1.481.51.5

11.55 1.5

9

1.82

1.96

RM

SEResNetKDeepOriginal KDeep

Figure 4.1: Best validation error per model type

large-filter KDeep and ResNet-101 models, respectively; where the clustersare obtaining by 90% similarity cutoff of the complexes’ proteins (See Ap-pendix B for a list of the clusters and their complexes).

In all result tables, ρ and R indicate Spearman’s and Pearson’s correlationcoefficients, respectively.

4.1.1 Original KDeep - HTMD features

We tried to reproduce the network that KDeep[24] published on their sup-plementary information. It was not obvious how to deduce some of theparameters of the architecture, such as the size of some filters, because theywere not reported. Using the number of parameters of the network, we man-aged to infer most of the shapes and sizes of the layers. Only two parameterswere unidentified: the sizes of the pooling layers, as they are independentof both output size and parameter count. For these unknown parameterswe used the ones given by SqueezeNet[23], the architecture which KDeep is

24

4.1. Prediction results

2.09 baseline[24]

HTM

D

APB

SA

PBS

EN ENH

TMD

HTM

D+E

NR

oset

taH

TMD

Ros

etta

+EN

HTM

D+R

oset

ta+E

NH

TMD

+Ros

etta

HTM

D+R

oset

ta+E

NR

oset

ta+E

NH

TMD

HTM

D+R

oset

taH

TMD

+EN

1

1.5

2

2.5

1.93

1.95

1.96

1.98

2222.03

2.05

2.06

2.08

2.082.2

2 2.282.3

7

2.72.8

1

RM

SE

ResNetKDeepOriginal KDeep

Figure 4.2: Best CSAR HiQ Set 1 error per model type

Table 4.1: Original KDeep results with HTMD features

RMSE ρ RTraining set 1.48 0.65 0.66

Validation set 1.55 0.69 0.71CSAR HiQ set 1 2.22 0.69 0.64CSAR HiQ set 2 1.72 0.63 0.69

CSAR 2014 0.85 0.79 0.85

based on.

This architecture was trained 50 times for 100 epochs each time using HTMDfeatures only. The results appear in Table 4.1.

These results are far from what was reported by KDeep’s paper[24]. We havechecked their prediction results for the validation set with PlayMolecule’swebserver, where KDeep is offered, and the results do match the RMSE of

25

4. Results

1.92 baseline[24]H

TMD

APB

SA

PBS

EN ENR

oset

taR

oset

ta+E

NR

oset

ta+E

NH

TMD

+Ros

etta

+EN

Ros

etta

HTM

D+E

NH

TMD

HTM

D+E

NH

TMD

+Ros

etta

HTM

D+R

oset

taH

TMD

HTM

D+R

oset

ta+E

N

1

1.2

1.4

1.6

1.8

2

2.2

2.4

1.491.5

21.56

1.56

1.58

1.591.6

31.6

41.6

51.681.7

71.7

81.85

2.042.1

5

2.39

1.72

RM


Figure 4.3: Best CSAR HiQ Set 2 error per model type

1.27. This leads us to believe that some kind of preprocessing was appliedto the data that was not explicitly mentioned on the paper. Because of thepoor performance of this baseline, we decided not to train this specific archi-tecture with any of the other features and representations.

4.1.2 Larger filter KDeep

We implemented a slight variation of the KDeep network, in which we usethe filter shapes from SqueezeNet. We performed the 8 experiments men-tioned above.

We trained the network at least 50 times per feature combination for 100epochs. The results appear in Table 4.2.

We can observe that the best performance in the RMSE measure from allbut the CSAR 2014 dataset was obtained by the combination of HTMD andRosetta features, with the combination of HTMD, Rosetta and electronega-

26


baseline[24] 1.75

HTM

D

HTM

D+R

oset

taH

TMD

HTM

D+E

NH

TMD

HTM

D+E

NH

TMD

+Ros

etta

+EN

HTM

D+R

oset

ta+E

NR

oset

ta+E

NH

TMD

+Ros

etta

Ros

etta

ENR

oset

taR

oset

ta+E

N EN

0.6

0.8

1

1.2

1.4

1.6

1.8

0.82

0.82

0.951.

031.051.071.

12

1.241.251.26

1.26

1.261.

341.42

0.85

RM


Figure 4.4: Best CSAR 2014 error per model type

tivity features as a close second best. This may indicate good interactionbetween Rosetta and HTMD features. The fact that the larger feature mapsize led to slightly worse results may also indicate the need to increase thewidth or depth of the architecture.

It is interesting to notice that, when comparing HTMD features with thecombination of Rosetta and electronegativity features, the performance wasvery similar with respect to RMSE, with the latter falling behind by less than0.2 units, and outperforming HTMD on CSAR 2014. This last behavior hap-pened consistently in the experiments that did not include HTMD features.

In terms of correlations, it is also clear that the HTMD plus Rosetta featuresachieve best or close to best performance compared to the other combina-tions.

In regards to APBS features, the performance was very poor. The RMSE forthe validation set was 1.82, very far behind any of the other feature combi-

27

4. Results

baseline[24] 0.82

HTM

D

ENR

oset

taR

oset

ta+E

NH

TMD

HTM

D+E

NH

TMD

+Ros

etta

+EN EN

HTM

D+R

oset

taR

oset

taR

oset

ta+E

NH

TMD

+EN

HTM

D+R

oset

taH

TMD

HTM

D+R

oset

ta+E

N

0

0.2

0.4

0.6

0.8 0.79

0.79

0.79

0.79

0.78

0.76

0.76

0.75

0.75

0.74

0.74

0.72

0.72

0.70.71

Pear

son

R


Figure 4.5: Validation set Pearson’s R per model type

nations. We did not explore any further this representation as it consistentlyshowed poor performance when combined with the rest of the feature maps.

In Figure 4.7 we can see the Pearson’s R coefficients per cluster. We canobserve that we obtain less anticorrelation in the bottom group of clusterscompared to the results from KDeep’s paper.

In general, the network seems to be generalizing well in terms of correla-tions, with them staying on a very close range regardless of the dataset inquestion.

4.1.3 ResNet-101

We use the ResNet with 101 layers. We trained this network with the subsetsof features mentioned above with 50 different random seeds for 100 epochs.The results appear in Table 4.3.

With this architecture, it is harder for us to find a best performing feature

28


Tabl

e4.

2:R

esul

tsfo

rth

eex

peri

men

tsw

hen

usin

ga

larg

efil

ter

KD

eep

netw

ork

HTM

DR

oset

taEN

RM

SEρ

RR

MSE

ρR

RM

SEρ

RTr

aini

ngse

t1.

399

0.70

80.

712

1.41

10.

702

0.70

91.

588

0.58

50.

599

Val

idat

ion

set

1.48

40.

731

0.74

51.

508

0.70

00.

725

1.59

50.

680

0.69

9C

SAR

HiQ

set

12.

058

0.74

70.

724

2.07

60.

739

0.75

92.

367

0.58

20.

568

CSA

RH

iQse

t2

1.59

20.

710

0.75

81.

780

0.69

80.

699

2.04

10.

507

0.55

3C

SAR

2014

1.34

20.

775

0.82

30.

952

0.75

10.

767

0.81

90.

802

0.85

0H

TM

D+

RH

TMD

+EN

R+

ENH

TMD

+R+E

NR

MSE

ρR

RM

SEρ

RR

MSE

ρR

RM

SEρ

RTr

aini

ngse

t1.

259

0.76

80.

770

1.38

30.

715

0.71

81.

394

0.70

50.

710

1.29

10.

754

0.75

9V

alid

atio

nse

t1.

429

0.73

90.

764

1.47

00.

731

0.74

21.

514

0.69

40.

719

1.44

80.

734

0.74

6C

SAR

HiQ

set

11.

960

0.78

70.

780

2.08

40.

732

0.70

02.

049

0.71

60.

717

2.03

30.

687

0.72

4C

SAR

HiQ

set

21.

559

0.76

90.

768

1.63

50.

701

0.73

51.

766

0.61

30.

656

1.48

90.

736

0.77

9C

SAR

2014

1.06

70.

750

0.81

21.

265

0.75

10.

817

0.82

30.

780

0.82

81.

251

0.70

80.

776

29

4. Results

baseline[24] 0.66

HTM

D

EN ENR

oset

ta+E

NR

oset

ta+E

NR

oset

taH

TMD

+Ros

etta

+EN

HTM

D+E

NH

TMD

+Ros

etta

+EN

HTM

DH

TMD

+Ros

etta

Ros

etta

HTM

DH

TMD

+EN

HTM

D+R

oset

ta

0

0.2

0.4

0.6

0.8 0.79

0.77

0.77

0.77

0.76

0.76

0.76

0.75

0.75

0.74

0.74

0.73

0.70.6

60.7

3

Pear

son

R


Figure 4.6: Averaged test sets Pearson’s R per model type

combination. While HTMD seems to perform best for CSAR HiQ Set 2,we also see very good performance for CSAR HiQ Set 1 on both Rosettaand HTMD plus electronegativity features. The combination that achievedthe best performance in our validation set was the combination of HTMD,Rosetta and electronegativity features. Overall, both Rosetta and HTMDachieve good results in some of the datasets, while at the same time usingfewer feature maps than the combination of the two.

In ResNet’s case the combination of Rosetta and electronegativity improvedthe prediction of the validation set, but worsened the predictions of all testdatasets with respect to Rosetta features alone. This behavior is oppositeto the one obtained in the large-filter KDeep architecture. Also, the differ-ence in performance between HTMD plus Rosetta and HTMD, Rosetta andelectronegativity was very small.

Here again, the performance of APBS features was very poor, with an RMSEon the validation set of 1.96, and is not included in the result’s table as we

30

4.2. Representation results

did not continue exploring this feature map.

In Figure 4.8 we can see the Pearson’s R coefficients per cluster. We observelittle difference between these correlations and the ones from the previousnetwork. There is a slight worse performance in the bottom clusters, thoughone of the previously anticorrelated is now uncorrelated. We can also ob-serve that the per cluster performance is not exactly the same as with thelarge-filter KDeep network.

4.1.4 Comparison of results with other networks

Here, we compare the results published in KDeep’s paper[24], in which theyreport errors for the PDBBind core set and CSAR HiQ Set 1, Set 2 and v.2014sets for both their network and their reproduction of RF-Score[34].

We can see that both large-filter KDeep and ResNet-101 outperform the re-sults reported in KDeep’s paper for all the test datasets considered. In fact,large-filter KDeep also achieves a performance similar or slightly better com-pared to RF-Score in the test datasets. With regards to the validation set,PDBBind Core set, our networks did not manage to beat the results reportedin KDeep’s paper. With ResNet-101 we managed to achieve a performancesimilar to RF-Score, with better result in the validation set and worse resultin the CSAR 2014 test dataset.

4.2 Representation results

In this section, we compare the different representation for Rosetta and elec-tronegativity maps. To argue about them, we used the best performingnetwork of ResNet-101 architecture. The results appear in Table 4.5.

We have observed that the distribution of the inverse exponential filter, whenapplied to our dataset with grids of side length 25, is very binary-like, withvalues usually either larger than 0.7 or lower than 0.1. This together withthe fact that the maps we apply this filter to are normalized to the range 0to 1, makes the maps look very much like a 3D binary grid.

We observed that performance did not substantially change between theinverse exponential and Gaussian filters. Except for the results in CSARHiQ Set 2, the values for RMSE and correlations do not differ much betweenthese two filters.

Regarding interpolation methods, linear interpolation showed worse behav-ior that the other two when comparing the RMSE of the test datasets. Therewas no significant difference in performance between Gaussian and thin-plate interpolations, with Gaussian interpolation performing slightly betteron the validation set and CSAR HiQ 1 but worse on CSAR HiQ 2.

31

4. Results

Tabl

e4.

3:R

esul

tsfo

rth

eex

peri

men

tsw

hen

usin

ga

Res

Net

-101

netw

ork

HTM

DR

oset

taEN

RM

SEρ

RR

MSE

ρR

RM

SEρ

RTr

aini

ngse

t1.

125

0.81

70.

821

0.82

00.

912

0.91

11.

360

0.73

90.

747

Val

idat

ion

set

1.34

20.

775

0.79

01.

409

0.73

90.

762

1.45

70.

733

0.75

3C

SAR

HiQ

set

11.

976

0.79

30.

748

1.93

10.

748

0.77

52.

285

0.66

60.

630

CSA

RH

iQse

t2

1.52

40.

736

0.77

11.

637

0.68

40.

718

1.85

50.

604

0.66

6C

SAR

2014

1.25

80.

735

0.76

51.

051

0.77

20.

808

1.00

30.

752

0.79

6H

TM

D+

RH

TMD

+EN

R+

ENH

TMD

+R+E

NR

MSE

ρR

RM

SEρ

RR

MSE

ρR

RM

SEρ

RTr

aini

ngse

t0.

900

0.89

00.

890

1.06

30.

840

0.84

30.

977

0.87

30.

874

0.99

70.

866

0.86

4V

alid

atio

nse

t1.

332

0.76

90.

788

1.35

80.

772

0.78

61.

360

0.75

70.

779

1.32

20.

777

0.79

3C

SAR

HiQ

set

12.

008

0.76

80.

742

1.95

10.

807

0.77

31.

996

0.75

90.

750

2.00

10.

799

0.76

1C

SAR

HiQ

set

21.

560

0.71

80.

751

1.58

50.

733

0.75

11.

679

0.69

70.

707

1.65

50.

707

0.72

0C

SAR

2014

1.42

50.

724

0.77

91.

259

0.72

70.

779

1.12

10.

742

0.76

91.

236

0.71

20.

784

32

4.2. Representation results

Figure 4.7: Pearson R per cluster for PDBBind Core set for large-filterKDeep.

33

4. Results

Table 4.4: Comparison between KDeep’s paper results and our networks’results.

KDeep paper This workKDeep RF-Score KDeep Large-KDeep ResNet-101

RMSE R RMSE R RMSE R RMSE R RMSE RPDBBind Core set 1.27 0.82 1.39 0.80 1.55 0.71 1.43 0.76 1.32 0.79CSAR HiQ set 1 2.09 0.72 1.99 0.78 2.17 0.64 1.96 0.78 2.00 0.76CSAR HiQ set 2 1.92 0.65 1.66 0.75 1.70 0.69 1.56 0.77 1.65 0.72

CSAR 2014 1.75 0.61 0.87 0.81 0.85 0.85 1.06 0.81 1.24 0.78

We saw a larger margin in performance when comparing RMSE between fil-ters and interpolation methods. We observed a consistent lower predictionerror in filter methods for all datasets. When comparing correlation coeffi-cients, except for linear interpolation, the performance was very similar.

4.3 Data augmentation results

We augmented the data using molecular dynamics simulations for 2ns. Fromthe 4463 starting complexes, we obtained a total of 48935 poses without er-rors. Around 20 of the original complexes failed the simulations consistentlyand had to be excluded.

The simulations took more than one week to complete when using around10 Titan X GPUs.

We run the best performing experiments once again on the augmented datafor 15 epochs (equivalent to more than 100 epochs on the original data). Werun the experiments 25 times each. The results appear in Table 4.6. The vali-dation error did not improve when compared to the same networks trainedusing the original dataset.

34

4.3. Data augmentation results

Figure 4.8: Pearson R per cluster for PDBBind Core set. for ResNet-101

35

4. Results

Tabl

e4.

5:R

esul

tsfo

rth

eex

peri

men

tson

diff

eren

t3D

repr

esen

tati

ons

ofR

oset

ta+

elec

tron

egat

ivit

ym

aps

usin

ga

Res

Net

-10

1ne

twor

k

Filt

erIn

terp

olat

ion

Inve

rse

Expo

nent

ial

Gau

ssia

nLi

near

Gau

ssia

nTh

inPl

ate

RM

SEρ

RR

MSE

ρR

RM

SEρ

RR

MSE

ρR

RM

SEρ

RTr

aini

ngse

t0.

970.

870.

871.

100.

820.

831.

120.

820.

821.

210.

790.

791.

260.

770.

77V

alid

atio

nse

t1.

360.

760.

781.

380.

750.

771.

410.

740.

761.

470.

790.

791.

480.

700.

73C

SAR

HiQ

set

12.

000.

760.

751.

970.

780.

762.

120.

680.

672.

080.

710.

732.

010.

670.

70C

SAR

HiQ

set

21.

680.

700.

701.

740.

600.

671.

950.

490.

581.

750.

600.

671.

690.

620.

69C

SAR

2014

1.12

0.74

0.77

0.95

0.74

0.80

2.09

0.82

0.81

1.27

0.75

0.78

1.44

0.82

0.78

36

4.3. Data augmentation results

Table 4.6: Results for the best performing networks on MD augmented data.

Large-filter KDeep ResNet-101Dataset HTMD + Rosetta HTMD + Rosetta + EN

Validation RMSE 1.458 1.389

37

Chapter 5

Discussion

In this chapter we offer our deductions from the information given in theresults.

The goals of this project were:

1. Assessing the performance of energy-based features.

2. Comparing 3D voxelization procedures for pointwise spatial features.

3. Improving performance by tuning or changing the neural network ar-chitecture.

4. Exploring whether data preprocessing and augmentation can improvepredictive power.

On the first point, we have observed that it is definitely possible to useenergy-based features like the Rosetta force-field used here. The perfor-mance of Rosetta features with ResNet is close to the performance of pre-vious state-of-the-art RF-Score. We also observed improvement when com-bined with HTMD or electronegativity maps, which indicates that it couldbe used as an extension of current molecular descriptor based models. Wehave also observed that correlation coefficients improved when these fea-tures were added to HTMD features, which may showcase the diversity ofinformation that these new maps provide. We also tested using APBS fea-tures, but their performance was very poor. Considering the significance ofelectrostatics in protein-ligand binding, we believe that the drop in perfor-mance may be caused by the continuous nature of the maps, as all the othermaps had a more stepwise shape due to the nature of the filters.

On the second point, we have tested five different voxelization methods andfound no significant improvement to what the state-of-the-art is. It was notsurprising to see that linear interpolation was the worst performing overall,considering that energy distributes as an inverse polynomial through space,

39

5. Discussion

but it was interesting to see that the performance was still within accept-able margins when compared to the behavior of all methods shown in thisproject.

On the third point, we did not manage to reproduce the original resultsreported by KDeep’s paper. We built our own reproduction based on thediagrams that were released on KDeep’s supplementary information, andbased on that, we beat its prediction RMSE with the other architecturesby a margin of around 0.25, but our best result was still 0.06 units awayfrom the reported KDeep result. We have managed to corroborate KDeep’sresult by using the public webserver they offer. This can only indicate thatthere is some difference in the preprocessing or the neural network that wewere not able to account for. Regardless of this, we have shown that ourbest model based on KDeep’s architecture gets better generalization resultswhen considering not only validation set but the other three datasets. Andeven in the validation set, we have observed less anticorrelation, which mayindicate that the additional Rosetta features add some new information tothe HTMD features.

Finally, we have developed a pipeline to preprocess the protein-ligand com-plexes in a standard way. This pipeline consists in mainly relaxing the ligandinside the protein to lower the energy of the complex, and thus reducing thediverse biases that the data may have due to the inherent characteristics ofits extraction methods (i.e. X-Ray crystallography). The results show a bettergeneralization in all combinations of features and neural network architec-tures, which leads us to think that this minimization may be successful inmaking the data harder to overfit to. We also augmented the dataset ten-fold by simulating the movement of the protein for 2ns, thus enlarging thedataset size from 4.4k to 48k poses. The simulations took more than oneweek to complete when using around 10 Titan X GPUs. Because of this, wehave stored both the resultant poses as well as the checkpoints so that thesimulations can be continued if there is any need to increase the dataset. Wehave seen that augmenting the dataset by molecular dynamics simulationsdid not directly improve the performance, which may be because the simu-lations were too short. It could also be the case that Rosetta relaxation maybe removing the effect of the simulations when optimizing the position ofthe atoms around the ligand. Because we only ran the experiments for 25times, it is a possibility that we got unlucky with the random seeds used inthem.

If we were to continue this study, some ideas for next steps would be thefollowing:

• Regarding 3D voxelization, it would be interesting to let the neural net-work learn the filter. This could be achieved by using deconvolutionallayers, for example.

40

• Regarding data augmentation, it would be interesting to generate fakecomplexes to give the neural network examples of “bad binding affin-ity” for all the proteins. We noticed that some protein clusters are over-represented, and this approach could help remove bias in the dataset.

• Because the per cluster correlations changed when comparing large-filter KDeep and ResNet, it would be a possibility to train multiplenetworks and ensemble them to slightly improve the performance.

• It would probably be beneficial to add structures of different nature:different types of proteins, different extraction methods (i.e. not onlystructures obtained with X-Ray crystallography) and also repeatedcomplexes from different sources.

• Finding the neural network architecture could be left to some of therecently popularized AutoML techniques. This could find a customstructure that may fit better the nature of the problem treated here.

41

Chapter 6

Conclusions

In this work we have developed on top of the preceding results by KDeep[24]in order to improve the prediction of binding affinity in protein-ligand com-plexes. We have approached this by four different routes: modification ofthe features, modification of the neural network architecture, data prepro-cessing, and augmentation of the dataset.

Our work on the feature maps has shown that it is possible to get decent,albeit not the best, performing out of energy-based feature maps like ourmaps based in the Rosetta all-atom force field. We have also explored dif-ferent way of representing these energy features in 3D space by applicationof different filters and interpolation methods, and shown that the methodused by the state-of-the-art (the inverse exponential filter) was best perform-ing compared to the other ones.

In regards to the neural network architecture, we were not able to reproducethe results of KDeep’s paper. We tried to reproduce the original KDeep net-work but the results were far from close, which indicates that we may belacking some step in our data preprocessing or some mistake in the imple-mentation of the network. We developed two other networks, a variant ofKDeep with a larger filter on the first convolution, and a ResNet-101 archi-tecture. We have been unable to achieve a better result for the validationset, PDBBind’s core set, but we managed to improve on all the test datasets.We have also obtained Pearson correlation coefficients that are much moreconsistent between validation and test data, indicating a better generaliza-tion behavior. Protein-wise, we have obtained similar positive correlationsand less negatively correlated proteins than KDeep’s results. Because of theusage of binding affinity as a ranking measurement to compare how wellligands bind to a single protein, this result indicates a better performance ofour networks as ranking tools.

As for data preprocessing, we have introduced a process of relaxation of

43

6. Conclusions

the ligand inside the protein using the Rosetta force field. We have seenimproved results in some of the test datasets when comparing KDeep’s re-sult with our reproduction of their network, indicating that the process ofrelaxation may better the spatial structure of the inputted complexes. Thisis reinforced by the later results with large-filter KDeep and ResNet-101,which obtained very balanced correlation coefficients even when using theonly HTMD features.

Finally, we have augmented the dataset by applying molecular dynamicssimulations, growing it by 10 times.

44

Bibliography

[1] Alexander fleming discovery and development of penicillin - landmark.http://www.acs.org/content/acs/en/education/whatischemistry/

landmarks/flemingpenicillin.html.

[2] scipy.interpolate.rbf. http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.Rbf.html.

[3] mendeleev – a python resource for properties of chemical ele-ments, ions and isotopes, ver. 0.4.5, 2014. https://bitbucket.org/

lukaszmentel/mendeleev.

[4] Rebecca F. Alford, Andrew Leaver-Fay, Jeliazko R. Jeliazkov, Matthew J.O’Meara, Frank P. DiMaio, Hahnbeom Park, Maxim V. Shapovalov,P. Douglas Renfrew, Vikram K. Mulligan, Kalli Kappel, Jason W.Labonte, Michael S. Pacella, Richard Bonneau, Philip Bradley, Roland L.Dunbrack, Rhiju Das, David Baker, Brian Kuhlman, Tanja Kortemme,and Jeffrey J. Gray. The rosetta all-atom energy function for macro-molecular modeling and design. Journal of Chemical Theory and Compu-tation, 13(6):3031–3048, 2017. PMID: 28430426.

[5] B. R. Brooks, C. L. Brooks III, A. D. Mackerell Jr., L. Nilsson, R. J. Pe-trella, B. Roux, Y. Won, G. Archontis, C. Bartels, S. Boresch, A. Caflisch,L. Caves, Q. Cui, A. R. Dinner, M. Feig, S. Fischer, J. Gao, M. Hodoscek,W. Im, K. Kuczera, T. Lazaridis, J. Ma, V. Ovchinnikov, E. Paci, R. W.Pastor, C. B. Post, J. Z. Pu, M. Schaefer, B. Tidor, R. M. Venable, H. L.Woodcock, X. Wu, W. Yang, D. M. York, and M. Karplus. Charmm: Thebiomolecular simulation program. Journal of Computational Chemistry,30(10):1545–1614, 2009.

[6] Zixuan Cang and Guo-Wei Wei. TopologyNet: Topology based deepconvolutional and multi-task neural networks for biomolecular prop-erty predictions. PLOS Computational Biology, 13(7):e1005690, jul 2017.

45

http://www.acs.org/content/acs/en/education/whatischemistry/landmarks/flemingpenicillin.html

http://www.acs.org/content/acs/en/education/whatischemistry/landmarks/flemingpenicillin.html

http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.Rbf.html

http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.Rbf.html

https://bitbucket.org/lukaszmentel/mendeleev

https://bitbucket.org/lukaszmentel/mendeleev

Bibliography

[7] Heather A. Carlson, Richard D. Smith, Kelly L. Damm-Ganamet,Jeanne A. Stuckey, Aqeel Ahmed, Maire A. Convery, Donald O. Somers,Michael Kranz, Patricia A. Elkins, Guanglei Cui, Catherine E. Peishoff,Millard H. Lambert, and James B. Dunbar. Csar 2014: A benchmarkexercise using unpublished data from pharma. Journal of Chemical Infor-mation and Modeling, 56(6):1063–1077, 2016. PMID: 27149958.

[8] Henry F Chambers and Frank R DeLeo. Waves of resistance: Staphylo-coccus aureus in the antibiotic era. Nature Reviews Microbiology, 7(9):629,2009.

[9] Wendy D. Cornell, Piotr Cieplak, Christopher I. Bayly, Ian R. Gould,Kenneth M. Merz, David M. Ferguson, David C. Spellmeyer, ThomasFox, James W. Caldwell, and Peter A. Kollman. A second generationforce field for the simulation of proteins, nucleic acids, and organicmolecules. Journal of the American Chemical Society, 117(19):5179–5197,1995.

[10] Kevin Cowtan. Phase Problem in X-ray Crystallography, and Its Solution.American Cancer Society, 2003.

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: ALarge-Scale Hierarchical Image Database. In CVPR09, 2009.

[12] Joseph A. DiMasi, Henry G. Grabowski, and Ronald W. Hansen. In-novation in the pharmaceutical industry: New estimates of r&d costs.Journal of Health Economics, 47:20 – 33, 2016.

[13] S. Doerr, M. J. Harvey, Frank Noe, and G. De Fabritiis. Htmd:High-throughput molecular dynamics for molecular discovery. Jour-nal of Chemical Theory and Computation, 12(4):1845–1852, 2016. PMID:26949976.

[14] Kurt L. M. Drew, Hakim Baiman, Prashanna Khwaounjoo, Bo Yu, andJohannes Reynisson. Size estimation of chemical space: how big is it?Journal of Pharmacy and Pharmacology, 64(4):490–495, 2012.

[15] James B. Dunbar, Richard D. Smith, Chao-Yie Yang, Peter Man-Un Ung,Katrina W. Lexa, Nickolay A. Khazanov, Jeanne A. Stuckey, ShaomengWang, and Heather A. Carlson. Csar benchmark exercise of 2010: Se-lection of the protein–ligand complexes. Journal of Chemical Informationand Modeling, 51(9):2036–2046, 2011. PMID: 21728306.

[16] Stephan Ehrlich, Andreas H. Goller, and Stefan Grimme. To-wards full quantum-mechanics-based protein–ligand binding affinities.ChemPhysChem, 18(8):898–905, 2017.

46

Bibliography

[17] F. Fogolari, A. Brigo, and H. Molinari. The poisson–boltzmann equationfor biomolecular electrostatics: a tool for structural biology. Journal ofMolecular Recognition, 15(6):377–392, 2002.

[18] Cunliang Geng, Li C. Xue, Jorge Roel-Touris, and Alexandre M. J. J. Bon-vin. Finding the δδg spot: Are predictors of binding affinity changesupon mutations in protein–protein interactions ready for it? Wiley In-terdisciplinary Reviews: Computational Molecular Science, 0(0):e1410.

[19] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of train-ing deep feedforward neural networks. In Yee Whye Teh and MikeTitterington, editors, Proceedings of the Thirteenth International Conferenceon Artificial Intelligence and Statistics, volume 9 of Proceedings of MachineLearning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy,13–15 May 2010. PMLR.

[20] Md Mahmudulla Hassan, Daniel Castaneda Mogollon, Olac Fuentes,and suman sirimulla. DLSCORE: A Deep Learning Model for Predict-ing Protein-Ligand Binding Affinities. 4 2018.

[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for im-age recognition. In 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, June 2016.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identitymappings in deep residual networks. Lecture Notes in Computer Science,page 630–645, 2016.

[23] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, SongHan, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and <1mb model size. CoRR,abs/1602.07360, 2016.

[24] Jose Jimenez, Miha Skalic, Gerard Martınez-Rosell, and Gianni De Fab-ritiis. Kdeep: Protein–ligand absolute binding affinity prediction via3d-convolutional neural networks. Journal of Chemical Information andModeling, 58(2):287–296, 2018. PMID: 29309725.

[25] Elizabeth Jurrus, Dave Engel, Keith Star, Kyle Monson, Juan Brandi,Lisa E. Felberg, David H. Brookes, Leighton Wilson, Jiahui Chen, Ka-rina Liles, Minju Chun, Peter Li, David W. Gohara, Todd Dolinsky,Robert Konecny, David R. Koes, Jens Erik Nielsen, Teresa Head-Gordon,Weihua Geng, Robert Krasny, Guo-Wei Wei, Michael J. Holst, J. AndrewMcCammon, and Nathan A. Baker. Improvements to the apbs biomolec-ular solvation software suite. Protein Science, 27(1):112–128, 2018.

47

Bibliography

[26] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization. International Conference on Learning Representations, 12 2014.

[27] Eili Y. Klein, Thomas P. Van Boeckel, Elena M. Martinez, Suraj Pant,Sumanth Gandra, Simon A. Levin, Herman Goossens, and RamananLaxminarayan. Global increase and geographic convergence in antibi-otic consumption between 2000 and 2015. Proceedings of the NationalAcademy of Sciences, 115(15):E3463–E3470, 2018.

[28] Maria Kontoyianni. Docking and Virtual Screening in Drug Discovery,pages 255–266. Springer New York, New York, NY, 2017.

[29] S. Korolev, A. Safiullin, M. Belyaev, and Y. Dodonova. Residual andplain convolutional neural networks for 3d brain mri classification. In2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017),pages 835–838, April 2017.

[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenetclassification with deep convolutional neural networks. In Proceedings ofthe 25th International Conference on Neural Information Processing Systems -Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc.

[31] Indra Kundu, Goutam Paul, and Raja Banerjee. A machine learning ap-proach towards the prediction of protein–ligand binding affinity basedon fundamental molecular properties. RSC Adv., 8:12127–12137, 2018.

[32] Morgan Lawrenz, Jeff Wereszczynski, Juan Manuel Ortiz-Sanchez,Sara E. Nichols, and J. Andrew McCammon. Thermodynamic integra-tion to predict host-guest binding affinities. Journal of Computer-AidedMolecular Design, 26(5):569–576, May 2012.

[33] Andrew Leaver-Fay, Michael Tyka, Steven M. Lewis, Oliver F. Lange,James Thompson, Ron Jacak, Kristian W. Kaufman, P. Douglas Renfrew,Colin A. Smith, Will Sheffler, Ian W. Davis, Seth Cooper, Adrien Treuille,Daniel J. Mandell, Florian Richter, Yih-En Andrew Ban, Sarel J. Fleish-man, Jacob E. Corn, David E. Kim, Sergey Lyskov, Monica Berrondo,Stuart Mentzer, Zoran Popovic, James J. Havranek, John Karanicolas,Rhiju Das, Jens Meiler, Tanja Kortemme, Jeffrey J. Gray, Brian Kuhlman,David Baker, and Philip Bradley. Chapter nineteen - rosetta3: Anobject-oriented software suite for the simulation and design of macro-molecules. In Michael L. Johnson and Ludwig Brand, editors, ComputerMethods, Part C, volume 487 of Methods in Enzymology, pages 545 – 574.Academic Press, 2011.

[34] Hongjian Li, Kwong-Sak Leung, Man-Hon Wong, and Pedro J. Ballester.Improving autodock vina using random forest: The growing accuracy

48

Bibliography

of binding affinity prediction by the effective exploitation of larger datasets. Molecular Informatics, 34(2-3):115–126, 2015.

[35] Zhihai Liu, Minyi Su, Li Han, Jie Liu, Qifan Yang, Yan Li, and RenxiaoWang. Forging the basis for developing protein–ligand interaction scor-ing functions. Accounts of Chemical Research, 50(2):302–309, 2017. PMID:28182403.

[36] Siddharth Mahendran, Haider Ali, and Rene Vidal. Convolutional net-works for object category and 3d pose estimation from 2d images. InLaura Leal-Taixe and Stefan Roth, editors, Computer Vision – ECCV 2018Workshops, pages 698–715, Cham, 2019. Springer International Publish-ing.

[37] Peter Mueller. Crystal structure refinement - hydrogen atoms.

[38] Lucas Gregorio Nivon, Rocco Moretti, and David Baker. A pareto-optimal refinement method for protein design scaffolds. PLOS ONE,8(4):1–5, 04 2013.

[39] Mats H. M. Olsson, Chresten R. Søndergaard, Michal Rostkowski, andJan H. Jensen. Propka3: Consistent treatment of internal and surfaceresidues in empirical pka predictions. Journal of Chemical Theory andComputation, 7(2):525–537, 2011. PMID: 26596171.

[40] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Muller, andO. Anatole von Lilienfeld. Fast and accurate modeling of molecular at-omization energies with machine learning. Phys. Rev. Lett., 108:058301,Jan 2012.

[41] Piar Ali Shar, Weiyang Tao, Shuo Gao, Chao Huang, Bohui Li, Wen-juan Zhang, Mohamed Shahen, Chunli Zheng, Yaofei Bai, and YonghuaWang. Pred-binding: large-scale protein–ligand binding affinity predic-tion. Journal of Enzyme Inhibition and Medicinal Chemistry, 31(6):1443–1450, 2016. PMID: 26888050.

[42] Yue Shi, Zhen Xia, Jiajing Zhang, Robert Best, Chuanjie Wu, Jay W. Pon-der, and Pengyu Ren. Polarizable atomic multipole-based amoeba forcefield for proteins. Journal of Chemical Theory and Computation, 9(9):4046–4063, 2013. PMID: 24163642.

[43] Oleg Trott and Arthur J. Olson. Autodock vina: Improving the speedand accuracy of docking with a new scoring function, efficient optimiza-tion, and multithreading. Journal of Computational Chemistry, 31(2):455–461, 2010.

49

Bibliography

[44] Zhenqin Wu, Bharath Ramsundar, Evan.N. Feinberg, Joseph Gomes,Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande.Moleculenet: a benchmark for molecular machine learning. Chem. Sci.,9:513–530, 2018.

[45] wwPDB consortium. Protein Data Bank: the single global archive for3D macromolecular structure data. Nucleic Acids Research, 47(D1):D520–D528, 10 2018.

[46] A. Yaseen, W. A. Abbasi, and F. ul Amir Afsar Minhas. Protein bindingaffinity prediction using support vector regression and interfecial fea-tures. In 2018 15th International Bhurban Conference on Applied Sciencesand Technology (IBCAST), pages 194–198, Jan 2018.

50

Appendices

51

Appendix A

Dataset complexes

A.1 Training data: PDBBind 2018 Refined set

10gs 184l 185l 186l 187l 188l 1a1e 1a28 1a4k 1a4r 1a4w 1a691a94 1a99 1a9m 1a9q 1aaq 1add 1adl 1ado 1afk 1afl 1ai4 1ai51ai7 1aid 1aj7 1ajn 1ajp 1ajq 1ajv 1ajx 1alw 1amk 1amw 1apv1atl 1atr 1avn 1ax0 1azm 1b05 1b0h 1b1h 1b2h 1b32 1b38 1b3f1b3g 1b3h 1b3l 1b40 1b46 1b4h 1b4z 1b51 1b52 1b55 1b57 1b581b5h 1b5i 1b5j 1b6h 1b6j 1b6k 1b6l 1b7h 1b8n 1b8o 1b8y 1b9j1bai 1bcd 1bdq 1bgq 1bhf 1bhx 1bju 1bjv 1bm7 1bma 1bn1 1bn31bn4 1bnn 1bnq 1bnt 1bnu 1bnv 1bnw 1bp0 1bq4 1br6 1bty 1bv71bv9 1bwa 1bwb 1bxo 1bxq 1bxr 1bzj 1bzy 1c1r 1c1u 1c1v 1c3x1c4u 1c5c 1c5n 1c5o 1c5p 1c5q 1c5s 1c5t 1c5x 1c5y 1c70 1c831c84 1c86 1c87 1c88 1cbx 1ceb 1cet 1cgl 1ciz 1cnw 1cnx 1cny1cps 1ctt 1ctu 1d09 1d2e 1d3d 1d3p 1d4h 1d4i 1d4j 1d4k 1d4l1d4p 1d4y 1d6v 1d6w 1d7i 1d7j 1d9i 1dar 1det 1df8 1dgm 1dhi1dhj 1dif 1dl7 1dmp 1dqn 1drj 1drk 1drv 1dud 1duv 1dy4 1dzk1e1v 1e1x 1e2k 1e2l 1e3g 1e3v 1e4h 1e5j 1e6q 1e6s 1eb2 1ebw1ebz 1ec0 1ec1 1ec2 1ec3 1ec9 1ecq 1ecv 1efy 1egh 1ejn 1ela1elb 1elc 1eld 1ele 1elr 1enu 1eoc 1epo 1erb 1ew8 1ew9 1ex81ez9 1ezq 1f0r 1f0s 1f0t 1f0u 1f3e 1f4e 1f4f 1f4g 1f4x 1f571f5k 1f5l 1f73 1f74 1f8b 1f8c 1f8d 1f8e 1fao 1fch 1fcx 1fcy1fcz 1fd0 1fh7 1fh8 1fh9 1fhd 1fiv 1fjs 1fkb 1fkf 1fkg 1fkh1fki 1fkn 1fkw 1fl3 1flr 1fm9 1fo0 1fpc 1fq5 1ft7 1ftm 1fv01fzj 1fzk 1fzm 1fzo 1fzq 1g1d 1g2l 1g2o 1g30 1g32 1g35 1g361g3d 1g3e 1g45 1g46 1g48 1g4o 1g52 1g53 1g54 1g74 1g7f 1g7g1g7q 1g7v 1g85 1g98 1gaf 1gai 1gar 1gfy 1ghv 1ghw 1ghy 1ghz1gi1 1gi4 1gi7 1gj6 1gjc 1gnm 1gnn 1gno 1grp 1gvw 1gvx 1gwv1gx8 1gyx 1gyy 1h0a 1h1s 1h2k 1h2t 1h46 1h4w 1h5v 1h6h 1hbv1hdq 1hee 1hfs 1hi3 1hi4 1hi5 1hih 1hii 1hk4 1hlk 1hmr 1hms1hmt 1hn4 1hos 1hp5 1hpo 1hps 1hpv 1hpx 1hsh 1hsl 1hvh 1hvi1hvj 1hvk 1hvl 1hvr 1hvs 1hwr 1hxb 1hxw 1hyo 1i1e 1i2s 1i371i5r 1i7z 1i9n 1i9p 1ie9 1if7 1if8 1igb 1igj 1ii5 1iih 1iiq1ik4 1ikt 1ivp 1iy7 1izh 1izi 1j01 1j14 1j16 1j17 1j36 1j371j4r 1jak 1jao 1jaq 1jcx 1jet 1jeu 1jev 1jgl 1jlr 1jmf 1jmg1jn4 1jq8 1jqy 1jsv 1jvu 1jyq 1jys 1jzs 1k1j 1k1l 1k1m 1k1n1k1o 1k1y 1k21 1k22 1k27 1k4g 1k4h 1k6c 1k6p 1k6t 1k6v 1k9s1kav 1kc7 1kdk 1kel 1kjr 1km3 1kmy 1koj 1kpm 1ksn 1kug 1kui1kuk 1kv1 1kv5 1kyv 1kzk 1kzn 1l83 1l8g 1laf 1lag 1lah 1lan1lbf 1lbk 1lcp 1lee 1lf2 1lgt 1lgw 1lhu 1li2 1li3 1li6 1lke1lkk 1lkl 1lnm 1loq 1lpk 1lpz 1lrh 1lst 1lvu 1lyb 1lyx 1lzq1m0b 1m0n 1m0o 1m0q 1m1b 1m2p 1m2q 1m2r 1m2x 1m48 1m4h 1m5w1m7d 1m7i 1m7y 1m83 1mai 1mes 1met 1mfa 1mfd 1mfi 1mh5 1mjj1mmq 1mmr 1moq 1mq5 1mrn 1mrs 1mrw 1mrx 1msm 1msn 1mtr 1mu61mu8 1mue 1my4 1n0s 1n1m 1n3i 1n46 1n4h 1n4k 1n51 1n5r 1ndv1ndw 1ndy 1ndz 1nf8 1nfu 1nfw 1nfx 1nfy 1nh0 1nhu 1nhz 1nja1njc 1njd 1nje 1njs 1nki 1nl9 1nli 1nm6 1nny 1no6 1np0 1nq71nt1 1nvr 1nvs 1nw4 1nw5 1nw7 1nwl 1nz7 1o0f 1o0m 1o0n 1o1s1o2h 1o2j 1o2n 1o2o 1o2q 1o2r 1o2w 1o2z 1o30 1o33 1o35 1o361o38 1o3d 1o3i 1o3j 1o3l 1o5a 1o5c 1o5e 1o5g 1o5r 1o7o 1o86

53

A. Dataset complexes

1oar 1oau 1oba 1ocq 1od8 1odi 1odj 1oe8 1ogd 1ogg 1ogx 1ogz1ohr 1oif 1okl 1ols 1olu 1olx 1om1 1ony 1onz 1ork 1os0 1os51oss 1owe 1oxr 1oyq 1oz0 1p19 1p1o 1p57 1pa9 1pb8 1pb9 1pbq1pdz 1pfu 1pgp 1phw 1pkx 1pme 1pot 1ppc 1pph 1ppi 1ppk 1ppl1ppm 1pr5 1pro 1pvn 1px4 1pxo 1pxp 1pyn 1pz5 1pzi 1pzp 1q1g1q54 1q5k 1q65 1q72 1q7a 1q84 1q8w 1q91 1qan 1qaw 1qb1 1qb61qb9 1qbn 1qbo 1qbq 1qbr 1qbs 1qbt 1qbu 1qbv 1qf0 1qf2 1qft1qhc 1qin 1qji 1qk3 1qk4 1qka 1qkb 1ql7 1ql9 1qxk 1qxl 1qy11qy2 1qyg 1r0p 1r1h 1r1j 1r9l 1rbp 1rd4 1rjk 1rmz 1rnm 1rnt1ro6 1rp7 1rpf 1rpj 1rql 1rr6 1rtf 1s19 1s39 1s5z 1s89 1sb11sbg 1sdt 1sdu 1sdv 1sgu 1sh9 1siv 1sl3 1sld 1sln 1sqo 1sqt1sr7 1srg 1ssq 1stc 1str 1sv3 1sw2 1swg 1swr 1syh 1szd 1t311t32 1t4v 1t5f 1t7d 1t7j 1ta6 1tcx 1td7 1thz 1tjp 1tkb 1tlp1tmn 1tng 1tnh 1tni 1tom 1tpw 1tq4 1trd 1tsy 1ttm 1tx7 1txr1u0g 1u1w 1u33 1u71 1ua4 1ucn 1ugx 1uho 1ui0 1uj5 1uml 1uou1upf 1ur9 1usi 1usk 1usn 1utj 1utl 1utm 1utn 1uv6 1uvt 1uw61uwf 1uwt 1uwu 1uz1 1uz4 1uz8 1v0k 1v0l 1v11 1v16 1v1j 1v1m1v2j 1v2k 1v2l 1v2n 1v2o 1v2r 1v2s 1v2t 1v2u 1v2w 1v48 1v7a1vfn 1vyf 1vyg 1vzq 1w0z 1w11 1w13 1w3j 1w3k 1w3l 1w4p 1w4q1w5v 1w5w 1w5x 1w5y 1w7g 1w96 1w9u 1w9v 1wc1 1wcq 1wdn 1wht1wm1 1wn6 1ws1 1ws4 1wuq 1wur 1wvj 1x1z 1x38 1x39 1x8d 1x8j1x8r 1x8t 1xap 1xbo 1xd0 1xff 1xgi 1xh4 1xh5 1xh9 1xhy 1xjd1xk5 1xk9 1xka 1xkk 1xow 1xpz 1xq0 1xr9 1xt8 1xug 1xws 1y0l1y1z 1y20 1y3n 1y3p 1y3v 1y3x 1y6q 1yc4 1yda 1ydb 1ydd 1ydk1yds 1yei 1yej 1yet 1yfz 1yp9 1ype 1ypg 1ypj 1yq7 1yqj 1yqy1yvm 1z1h 1z4o 1z6s 1z71 1z9y 1zc9 1zdp 1zea 1zfq 1zge 1zgi1zhy 1zoe 1zog 1zoh 1zp8 1zpa 1zs0 1zsf 1zvx 2a14 2a4m 2a5b2a5c 2a5s 2a8g 2aac 2afw 2afx 2aj8 2am4 2amt 2ans 2aoc 2aod2aoe 2aog 2aqu 2arm 2avm 2avo 2avq 2avs 2ax9 2ayr 2azr 2b072b1g 2b1i 2b4l 2b7d 2b9a 2baj 2bak 2bal 2bes 2bet 2bfq 2bfr2bmk 2bo4 2boh 2boj 2bok 2bpv 2bpy 2bq7 2bqv 2brm 2bt9 2buv2bvd 2bvr 2bvs 2byr 2bys 2bz6 2bza 2c1p 2c3l 2c80 2c92 2c942c97 2ca8 2cbj 2cbu 2cbz 2cc7 2ccb 2ccc 2ce9 2cej 2cen 2ceq2cer 2ces 2cex 2cf8 2cf9 2cgf 2cgr 2cht 2cle 2clh 2cli 2clk2cn0 2csn 2ctc 2d0k 2d1n 2d1o 2d3u 2d3z 2doo 2drc 2dri 2dw72e1w 2e27 2e2p 2e2r 2e7f 2e91 2e92 2e94 2e9u 2epn 2erz 2euk2evl 2ewa 2ewb 2ews 2exm 2ez7 2f1g 2f2h 2f34 2f35 2f6t 2f7i2f7o 2f7p 2f80 2f81 2f8g 2f94 2f9k 2fdp 2fgu 2fgv 2fle 2flr2fmb 2fpz 2fqo 2fqt 2fqw 2fqx 2fqy 2fu8 2fw6 2fx6 2fxu 2fxv2fzc 2fzg 2fzk 2g5u 2g94 2gh9 2gj5 2gkl 2gl0 2glp 2gss 2gst2gsu 2gv6 2gv7 2gvj 2gvv 2gyi 2gz2 2gzl 2h15 2h21 2h3e 2h4g2h4k 2h4n 2h6b 2h6t 2ha2 2ha3 2ha6 2hah 2haw 2hb3 2hhn 2hjb2hkf 2hl4 2hmu 2hmv 2hnc 2hnx 2hoc 2hs1 2hu6 2hxm 2hzl 2hzy2i0a 2i19 2i2c 2i3h 2i3i 2i4d 2i4j 2i4u 2i4v 2i4w 2i4x 2i4z2i6b 2i80 2idw 2ihj 2ihq 2iko 2isw 2iuz 2izl 2j27 2j2u 2j342j47 2j4g 2j4i 2j62 2j75 2j77 2j79 2j7b 2j7d 2j7e 2j7f 2j7g2j94 2j95 2jdm 2jdp 2jds 2jdu 2jew 2jf4 2jfz 2jg0 2jgs 2jh02jh5 2jh6 2jiw 2jjb 2jke 2jkh 2jkp 2jxr 2mas 2nmx 2nmz 2nn12nn7 2nnd 2nsj 2nsl 2nt7 2nta 2o0u 2o4j 2o4k 2o4l 2o4n 2o4p2o4r 2o4s 2o4z 2o8h 2oag 2oax 2oc2 2ogy 2oi0 2oi2 2oiq 2ojg2ojj 2olb 2ole 2on6 2ot1 2ovv 2ovy 2oxd 2oxn 2oxx 2oxy 2oym2p09 2p16 2p2a 2p3a 2p3b 2p3c 2p3i 2p4j 2p4s 2p53 2p7a 2p7g2p7z 2p95 2pbw 2pcp 2pgz 2pk5 2pk6 2pou 2pov 2pow 2pq9 2pqb2pqc 2pql 2pqz 2psu 2psv 2ptz 2pu1 2pu2 2pv1 2pvh 2pvj 2pvk2pvl 2pvm 2pvu 2pwc 2pwd 2pwg 2pwr 2py4 2pym 2pyn 2pyy 2q1q2q2a 2q38 2q54 2q55 2q5k 2q63 2q64 2q6f 2q7q 2q88 2q89 2q8h2q8m 2q8z 2qbs 2qbu 2qbw 2qci 2qd6 2qd7 2qd8 2qdt 2qg0 2qg22qhy 2qhz 2qi0 2qi1 2qi3 2qi4 2qi5 2qi6 2qi7 2qm9 2qmg 2qnn2qnp 2qpq 2qpu 2qrk 2qrl 2qta 2qtg 2qtn 2qtt 2qu6 2qw1 2qwb2qwc 2qwd 2qwe 2qwf 2qzr 2r0h 2r0z 2r1y 2r23 2r2m 2r2w 2r382r3t 2r3w 2r43 2r58 2r59 2r5a 2r5p 2r75 2r9x 2ra0 2ra6 2rcb2rcn 2rd6 2reg 2rfh 2ri9 2rin 2rio 2rk8 2rka 2rkd 2rke 2rkf2rkg 2rkm 2sim 2std 2tmn 2tpi 2usn 2uwd 2uwl 2uwo 2uwp 2uxi2uxz 2uy0 2uy3 2uy4 2uy5 2uyn 2uyq 2uz9 2v25 2v2c 2v2h 2v2q2v2v 2v3d 2v3u 2v54 2v57 2v58 2v59 2v77 2v88 2v8w 2v95 2vb82vba 2vc9 2ves 2vfk 2vh0 2vh6 2vhj 2vj8 2vjx 2vk2 2vk6 2vl42vmc 2vmd 2vmf 2vnp 2vnt 2vo4 2vo5 2vot 2vpe 2vpn 2vpo 2vqt2vrj 2vsl 2vt3 2vuk 2vvc 2vvs 2vvu 2vvv 2vw1 2vw2 2vwc 2vwl2vwm 2vwn 2vwo 2vxn 2vyt 2vzr 2w08 2w26 2w47 2w5g 2w67 2w8j2w8w 2w8y 2w9h 2wb5 2wc3 2wc4 2we3 2web 2wec 2wed 2weh 2wej2weo 2weq 2wf5 2wgj 2whp 2wjg 2wk6 2wky 2wkz 2wl0 2wly 2wlz2wm0 2wnj 2wor 2wos 2wq5 2wr8 2wuf 2wvz 2wyf 2wyg 2wyj 2wzf2wzm 2wzs 2x09 2x0y 2x2r 2x4z 2x6x 2x7t 2x7u 2x8z 2x91 2x952x96 2x97 2xab 2xb7 2xbp 2xbw 2xbx 2xc0 2xc4 2xd9 2xda 2xde

54

A.1. Training data: PDBBind 2018 Refined set

2xdk 2xdx 2xef 2xeg 2xei 2xej 2xg9 2xhm 2xht 2xib 2xj1 2xj22xjg 2xjj 2xjx 2xm1 2xm2 2xmy 2xn3 2xn5 2xog 2xp7 2xpk 2xxr2xxt 2xxx 2xyd 2xye 2xyf 2xyt 2y5f 2y5g 2y7i 2y7x 2y7z 2y802y81 2y82 2y8c 2ya6 2ya7 2ya8 2yay 2yaz 2yb0 2ydt 2ydw 2yek2yel 2yfa 2yfx 2ygf 2yhw 2yi0 2yi7 2yix 2yk1 2ylc 2yme 2ypi2ypo 2yxj 2yz3 2z1w 2z4o 2z94 2za0 2za5 2zc9 2zcs 2zdk 2zdl2zdm 2zdn 2zfp 2zfs 2zft 2zgx 2zjw 2zkj 2zmm 2zn7 2zq0 2zq22zwz 2zx6 2zx7 2zx8 2zxd 2zxg 2zym 2zz1 2zz2 3a1c 3a1d 3a1e3a2o 3a5y 3a6t 3a9i 3aaq 3aas 3aau 3acl 3acx 3agl 3ahn 3aho3ai8 3aid 3alt 3ao2 3ao5 3ap4 3aqt 3arw 3arx 3axz 3b24 3b253b26 3b2q 3b3c 3b3s 3b3w 3b3x 3b4f 3b4p 3b50 3b66 3b67 3b7i3b7j 3b7r 3b7u 3b92 3bbb 3bbf 3be9 3bex 3bft 3bfu 3bgb 3bgc3bgq 3bgs 3bkk 3bkl 3bl0 3bl1 3bpc 3bqc 3bra 3brn 3bu1 3buf3bug 3buh 3bva 3bvb 3bwj 3bxe 3bxf 3bxg 3bxh 3bzf 3c2f 3c2o3c2r 3c2u 3c39 3c4h 3c52 3c56 3c79 3c84 3c88 3c89 3c8a 3c8b3cct 3ccw 3ccz 3cd0 3cd5 3cd7 3cda 3cdb 3cf8 3cfn 3cft 3cj23cj5 3ckb 3cke 3ckp 3ckz 3cl0 3cm2 3cow 3cs7 3ctt 3cyw 3cyx3cz1 3czv 3d0b 3d0e 3d1x 3d1y 3d1z 3d2e 3d4y 3d50 3d51 3d523d6o 3d6p 3d78 3d7k 3d7z 3d83 3d8w 3d8z 3d91 3d9z 3da9 3daz3dbu 3dc3 3dcc 3dd8 3ddf 3ddg 3dgo 3djk 3djo 3djp 3djq 3djv3djx 3dk1 3dln 3dnd 3dne 3dp4 3dp9 3drf 3drg 3dri 3dsz 3dx33dx4 3dyo 3dzt 3e12 3e3c 3e5u 3e6y 3eax 3eb1 3ebh 3ebi 3ebl3ebo 3ed0 3eeb 3eft 3egt 3ehx 3ejp 3ejq 3eko 3ekp 3ekr 3ekt3ekv 3ekw 3ekx 3el1 3el4 3el5 3el9 3elc 3eqr 3ery 3evd 3ewc3ewj 3exe 3exh 3f15 3f16 3f17 3f18 3f19 3f1a 3f33 3f34 3f373f48 3f5j 3f5k 3f5l 3f68 3f6e 3f6g 3f70 3f78 3f7g 3f7h 3f7i3f80 3f8c 3f8e 3f8f 3fas 3fat 3fed 3fee 3ff3 3ffg 3ffp 3fh73fhb 3fj7 3fjg 3fl5 3fqe 3fql 3fuc 3fuz 3fv3 3fvh 3fvk 3fvl3fvn 3fwv 3fx6 3fzn 3fzy 3g0e 3g0i 3g19 3g1d 3g1v 3g2y 3g303g32 3g34 3g35 3g3r 3g5k 3ga5 3gba 3gbe 3gc4 3gcp 3gcs 3gcu3gdt 3ggu 3gi4 3gi5 3gi6 3gjw 3gk1 3gkz 3gm0 3gqz 3gs6 3gsm3gss 3gst 3gt9 3gta 3gtc 3gvb 3gvu 3gx0 3gy2 3gy3 3gy7 3h1x3h30 3h5b 3h78 3h89 3h8b 3hb4 3hcm 3hek 3hf8 3hfb 3hig 3hit3hk1 3hkn 3hkq 3hkt 3hku 3hkw 3hky 3hl5 3hl7 3hl8 3hll 3hmo3hmp 3hp9 3hs4 3hu3 3hub 3huc 3hv8 3hvi 3hvj 3hww 3hzk 3hzm3hzv 3i25 3i3b 3i4b 3i4y 3i51 3i5z 3i60 3i6o 3i73 3i7e 3i9g3iae 3ibi 3ibl 3ibn 3ibu 3ies 3ifl 3igp 3ijh 3ikd 3ikg 3imc3ime 3iob 3ioc 3iod 3ioe 3iof 3iog 3ip5 3ip6 3ip8 3ip9 3iph3ipq 3ipu 3iqu 3isj 3iss 3iub 3iue 3ivc 3ivx 3iw5 3iw6 3iww3jdw 3jrs 3jrx 3juk 3juo 3jup 3jy0 3jyr 3jzh 3jzj 3k00 3k023k1j 3k2f 3k37 3k4d 3k4q 3k5x 3k8c 3k8o 3k8q 3k97 3k99 3kdb3kdc 3kdd 3kdm 3kek 3kgq 3kgt 3kgu 3kiv 3kjd 3kku 3kmc 3kmx3kmy 3kqr 3kr4 3kv2 3kyq 3l0v 3l3l 3l3m 3l3n 3l4u 3l4v 3l4w3l4x 3l4y 3l4z 3l59 3ldp 3ldq 3le9 3lea 3lgs 3lir 3liw 3ljg3ljo 3ljz 3lk8 3lmk 3lp4 3lp7 3lpi 3lpk 3lpl 3lpp 3lq2 3lvw3lxe 3lxk 3lzs 3lzu 3lzz 3m1k 3m35 3m36 3m37 3m3c 3m3x 3m3z3m40 3m5e 3m67 3m6r 3m8u 3m96 3mam 3mdz 3mf5 3mfv 3mfw 3mhc3mhi 3mhl 3mhm 3mho 3mhw 3mi2 3mi3 3miy 3mjl 3ml2 3ml5 3mmf3mna 3mof 3ms9 3muz 3mv0 3mxd 3mxe 3myq 3mz6 3mzc 3n0n 3n1c3n2p 3n2u 3n2v 3n35 3n3g 3n3j 3n4b 3n7o 3n8k 3n9r 3n9s 3nb53nee 3neo 3nes 3nex 3ng4 3nhi 3nht 3ni5 3nik 3nim 3nkk 3nox3npc 3nq3 3nsn 3nu3 3nu4 3nu5 3nu6 3nu9 3nuj 3nuo 3nw3 3nxq3nyd 3nyx 3nzk 3o4k 3o56 3o5n 3o5x 3o75 3o7u 3o84 3o8p 3o993o9a 3o9d 3o9e 3o9p 3oaf 3ocp 3ocz 3ohi 3oil 3oim 3ok9 3oku3old 3ouj 3ov1 3ove 3ovn 3owj 3own 3oy0 3oy8 3oyq 3oyw 3ozg3ozj 3ozp 3ozr 3p17 3p2e 3p3g 3p3r 3p3s 3p3t 3p4v 3p58 3p5l3p7i 3p8n 3p8o 3p8p 3p8z 3p9l 3p9m 3pb7 3pb8 3pb9 3pbb 3pce3pcf 3pcg 3pcj 3pck 3pcn 3pd8 3pd9 3pe1 3pe2 3pfp 3pgl 3pgu3pju 3pn1 3pn4 3po1 3po6 3ppm 3ppp 3ppq 3ppr 3ps1 3pwd 3pwk3pwm 3q1x 3q2j 3q44 3q6w 3q6z 3q71 3q7q 3qaa 3qbc 3qdd 3qfd3qfy 3qfz 3qgw 3qkd 3qlm 3qox 3qps 3qqa 3qt6 3qto 3qtv 3qw53qwc 3qx5 3qx9 3qxt 3qxv 3r16 3r17 3r1v 3r24 3r4m 3r4n 3r4p3r5t 3r6u 3r7o 3rbu 3rdo 3rdq 3re4 3rf4 3rf5 3rlb 3rlp 3rlq3rm4 3rm9 3roc 3rt8 3rtf 3ru1 3rux 3rv4 3rv8 3rwp 3ryv 3ryx3ryy 3ryz 3rz0 3rz1 3rz5 3rz7 3rz8 3s0b 3s0d 3s0e 3s2v 3s433s45 3s54 3s5y 3s6t 3s71 3s72 3s73 3s75 3s76 3s77 3s78 3s8l3s8n 3s8o 3s9e 3sfg 3sha 3shc 3si3 3si4 3sio 3sjf 3sk2 3slz3sm2 3spf 3sr4 3st5 3std 3str 3su0 3su1 3su2 3su3 3su4 3su53su6 3sue 3suf 3sug 3sur 3sus 3sut 3suu 3suv 3suw 3sv2 3sw83sww 3sxf 3t01 3t08 3t09 3t0b 3t0d 3t0x 3t1a 3t1m 3t2q 3t2w3t3c 3t3u 3t5u 3t60 3t64 3t6b 3t70 3t82 3t83 3t84 3t85 3t8v3ta0 3ta1 3tao 3tay 3tb6 3tcg 3td4 3tf6 3tfn 3tfp 3tfu 3th93tif 3tk2 3tkw 3tmk 3ts4 3tt4 3ttm 3ttp 3tu7 3tvc 3tz0 3tza3tzm 3u10 3u5l 3u6h 3u6i 3u81 3u8j 3u8l 3u90 3u92 3u93 3ubd

55


3ucj 3udd 3ug2 3uil 3uj9 3ujc 3ujd 3umq 3uod 3upk 3upv 3usx3uu1 3uug 3uw4 3uw5 3uxd 3uxk 3uxl 3uyr 3uz5 3uzj 3v2n 3v2p3v2q 3v3q 3v4t 3v51 3v5p 3v5t 3v78 3v7x 3vbd 3vd4 3vd9 3vdb3veh 3vf5 3vf7 3vfa 3vfb 3vh9 3vha 3vhc 3vhd 3vhk 3vjc 3vje3vtr 3vvy 3vw1 3vw2 3vx3 3w07 3w37 3w5n 3w9k 3w9r 3wgg 3wha3wjw 3wmc 3wtl 3wtm 3wtn 3wto 3wvm 3wz6 3wz7 3wzn 3x00 3zbx3zc5 3zcl 3zdh 3zdv 3zhx 3zi0 3zi8 3zj6 3zk6 3zll 3zln 3zlr3zm9 3znr 3zns 3zps 3zpu 3zq9 3zqe 3zsq 3zsy 3zt3 3zv7 3zxz3zyf 3zyu 3zze 456c 4a4q 4a4v 4a4w 4a6b 4a6c 4a6l 4a6s 4a7i4a95 4ab9 4aba 4abb 4abd 4abe 4abf 4abh 4acc 4aci 4ad2 4ad34ad6 4afg 4ag8 4agc 4agl 4agm 4ago 4ahr 4ahs 4ahu 4ai5 4aia4aj4 4aje 4aji 4ajl 4alx 4aoi 4ap7 4app 4aq4 4aq6 4aqh 4ara4arb 4arw 4asd 4ase 4asj 4att 4auj 4av4 4av5 4avh 4avi 4avj4avs 4ax9 4axd 4ayp 4ayq 4ayu 4az5 4az6 4azb 4azc 4azg 4azi4b0b 4b1j 4b2i 4b2l 4b32 4b33 4b34 4b35 4b3b 4b3c 4b3d 4b5d4b5s 4b5t 4b5w 4b6o 4b6p 4b6r 4b6s 4b73 4b74 4b76 4b7j 4b7p4b7r 4b8y 4b9k 4b9z 4bah 4bak 4bam 4ban 4bao 4baq 4bb9 4bc54bck 4bcm 4bcn 4bco 4bcp 4bcs 4bf1 4bf6 4bi6 4bi7 4bj8 4bks4bny 4bps 4bqg 4bqh 4bqs 4br3 4bs0 4bt3 4bt4 4bt5 4btk 4bup4buq 4c1t 4c1u 4c1y 4c2v 4c52 4c5d 4c6u 4c9x 4ca5 4ca6 4ca74ca8 4cc5 4cd0 4cd4 4cd5 4ceb 4cfl 4cg8 4cg9 4cga 4cgi 4cj44cjp 4cjq 4cjr 4ck3 4cl6 4clj 4cmo 4cp5 4cp7 4cpr 4cps 4cpt4cpw 4cpy 4cpz 4cr5 4crb 4crf 4crl 4cs9 4csd 4css 4cst 4cu74cu8 4cwf 4cwn 4cwo 4cwp 4cwq 4cwr 4cws 4cwt 4czs 4d1j 4d3h4d4d 4d7b 4d8z 4da5 4daf 4db7 4dbm 4dcs 4ddm 4de0 4de5 4del4der 4des 4det 4deu 4dew 4dff 4dfg 4dhl 4djo 4djp 4djq 4djr4dju 4djw 4djx 4djy 4dko 4dkp 4dkq 4dkr 4dmw 4do4 4do5 4dq24dst 4dsu 4dsy 4duh 4dv8 4dy6 4dzy 4e0x 4e1k 4e3g 4e4l 4e4n4e67 4e6d 4e70 4e7r 4e9u 4eb8 4ef6 4efk 4efs 4egk 4ehz 4ei44ej8 4ejl 4ek9 4elf 4elg 4elh 4emf 4emr 4en4 4eo6 4eoh 4epy4er1 4er2 4erf 4etz 4eu0 4euo 4ew2 4ew3 4ewn 4exs 4ezr 4ezx4ezz 4f0c 4f1l 4f39 4f3k 4f5y 4f6u 4f6w 4f7v 4f9u 4f9y 4fai4fcq 4fev 4few 4ffs 4fht 4fk6 4fl1 4fl2 4flp 4fm7 4fm8 4fnn4fp1 4fs4 4fsl 4fxp 4fxq 4fys 4fz3 4fzj 4g0p 4g0q 4g0y 4g0z4g4p 4g5f 4g8m 4g8n 4g8v 4g8y 4g90 4g95 4gah 4gbd 4ge1 4gfo4gg7 4ggz 4ghi 4gih 4gii 4gj2 4gj3 4gkh 4gki 4gny 4gq4 4gql4gqp 4gqq 4gqr 4gr3 4gr8 4gu6 4gu9 4gue 4gzp 4gzt 4gzw 4gzx4h3f 4h3g 4h3j 4h42 4h75 4h7q 4h81 4h85 4ha5 4hbm 4hdb 4hdf4hdp 4heg 4hf4 4hfp 4hj2 4hla 4hp0 4hpi 4ht0 4ht2 4hu1 4hw34hwo 4hwp 4hws 4hy1 4hym 4hzm 4i3z 4i54 4i5c 4i71 4i72 4i744i7j 4i7k 4i7l 4i7m 4i7p 4i8n 4i8w 4i8x 4i8z 4i9h 4i9u 4ibb4ibc 4ibd 4ibe 4ibf 4ibg 4ibi 4ibj 4ibk 4idn 4ido 4ieh 4igt4ih3 4ih6 4iic 4iid 4iie 4iif 4ij1 4in9 4io2 4io3 4io4 4io54io6 4io7 4ipi 4ipj 4ipn 4ish 4isi 4isu 4itp 4iue 4iuo 4iva4iwz 4j22 4j44 4j45 4j46 4j47 4j48 4j7d 4j7e 4j93 4jal 4je74je8 4jfk 4jfm 4jh0 4jkw 4jn2 4jne 4jpx 4jpy 4jsa 4jss 4jwk4jx9 4jyb 4jyc 4jym 4jyt 4jz1 4jzi 4k0o 4k0y 4k3h 4k3n 4k4j4k55 4k5p 4k6i 4k7i 4k7n 4k7o 4k9y 4kao 4kax 4kb9 4kcx 4keq4kfq 4kif 4kiu 4km0 4km2 4kmz 4kn0 4kn1 4kni 4knj 4knm 4knn4ko8 4kow 4kp5 4kp8 4kqp 4ks1 4ks4 4ksy 4kwf 4kwg 4kwo 4kx84kxb 4kxn 4kyh 4kyk 4kz3 4kz4 4kz7 4l19 4l2l 4l4v 4l4z 4l504l51 4l6t 4l9i 4lar 4lbu 4lch 4leq 4lhm 4lhv 4lj5 4lj8 4ljh4lk7 4lkk 4lko 4lkq 4ll3 4llj 4llk 4llp 4lm0 4lm1 4lm2 4lm34lm4 4loh 4loi 4loo 4lov 4loy 4lps 4lrr 4luz 4lvt 4lxd 4lxz4ly1 4ly9 4lyw 4lzr 4m0e 4m0f 4m0r 4m12 4m13 4m14 4m2r 4m2u4m2v 4m2w 4m3p 4m6u 4m7j 4m8e 4m8h 4m8x 4m8y 4mc1 4mc2 4mc64mc9 4mdn 4mhy 4mhz 4mjp 4mmm 4mmp 4mn3 4mnp 4mo4 4mo8 4mpn4mq6 4mr3 4mr6 4mre 4mrg 4msa 4msc 4mss 4muf 4mul 4muv 4myd4n07 4n5d 4n6g 4n6z 4n7m 4n7u 4n8q 4n9a 4n9c 4na9 4nbk 4nbl4nbn 4ncn 4ndu 4ngm 4ngn 4ngp 4nh7 4nh8 4nj9 4nja 4nkt 4nku4nl1 4nnr 4non 4np2 4np3 4np9 4nra 4nuc 4nue 4nvp 4nwc 4nxu4nxv 4nyf 4o04 4o05 4o07 4o09 4o0a 4o0b 4o0x 4o0y 4o2b 4o2c4o2p 4o3c 4o3f 4o61 4o6w 4o97 4o9v 4o9w 4oag 4oak 4oc0 4oc14oc2 4oc3 4oc5 4ocq 4oct 4oeu 4og3 4og4 4oiv 4oks 4oma 4omc4omj 4omk 4or4 4or6 4ou3 4ovf 4ovg 4ovh 4owv 4ozj 4p3h 4p584p5d 4p5z 4p6c 4p6w 4p6x 4pb2 4pee 4pf5 4pft 4pfu 4pg9 4phu4pin 4pmm 4pnu 4poh 4poj 4pop 4pow 4pox 4pp0 4pp3 4pp5 4pqa4psb 4pum 4pv5 4pvx 4pvy 4pzv 4q08 4q09 4q0k 4q19 4q1w 4q1x4q1y 4q3t 4q3u 4q46 4q4o 4q4p 4q4q 4q4r 4q4s 4q6d 4q6e 4q7p4q7s 4q7v 4q7w 4q81 4q83 4q87 4q8x 4q8y 4q90 4q93 4q99 4q9o4q9y 4qb3 4qdk 4qem 4qer 4qev 4qew 4qf7 4qf8 4qf9 4qfl 4qfn4qfo 4qfp 4qgd 4qge 4qgi 4qij 4qj0 4qjw 4qjx 4ql1 4qlk 4qll4qnb 4qp2 4qpd 4qpl 4qrh 4qsu 4qsv 4qtl 4qxo 4qy3 4qyy 4r064r0a 4r3w 4r4c 4r4i 4r4o 4r4t 4r59 4r5a 4r5b 4r5t 4r73 4r74

56

A.1. Training data: PDBBind 2018 Refined set

4r75 4r76 4ra1 4rak 4rd0 4rd3 4rd6 4rdn 4re2 4re4 4rfc 4rfd4rfr 4rhx 4riu 4riv 4rj8 4rlt 4rlu 4rlw 4rn4 4rpn 4rpo 4rqk4rqv 4rr6 4rra 4rrf 4rrg 4rsk 4rux 4ruy 4ruz 4rvr 4rwj 4rww4ryd 4s1g 4sga 4std 4tim 4tjz 4tkb 4tkh 4tkj 4tln 4tmk 4tpw4tqn 4trc 4ts1 4tt2 4tte 4tu4 4tun 4ty6 4tz2 4u0f 4u0w 4u1b4u43 4u54 4u5n 4u5o 4u5s 4u69 4u6c 4u6w 4u6z 4u70 4u71 4u734u8w 4ua8 4uac 4ual 4uc5 4ucc 4ufh 4ufi 4ufj 4ufk 4ufl 4ufm4uin 4uj1 4uj2 4uja 4ujb 4uma 4umb 4umc 4und 4unp 4uof 4uoh4up5 4ury 4urz 4us3 4uye 4uyf 4v01 4v24 4v27 4w52 4w97 4w9d4w9f 4w9j 4w9k 4w9o 4w9p 4wa9 4whs 4wk1 4wkb 4wkn 4wko 4wkp4wn5 4wop 4wov 4wrb 4wt2 4x24 4x3k 4x48 4x50 4x5p 4x5q 4x5r4x5y 4x5z 4x6m 4x6n 4x6o 4x8o 4x8u 4x8v 4xaq 4xar 4xas 4xip4xiq 4xir 4xit 4xk9 4xmb 4xmr 4xo8 4xoc 4xoe 4xt2 4xtv 4xtw4xtx 4xty 4xtz 4xu0 4xu1 4xu2 4xu3 4xxh 4xy8 4xya 4y0a 4y2q4y3j 4y3y 4y4j 4y59 4y5d 4y79 4y8x 4ybk 4yc0 4yes 4ygf 4yha4yhm 4yho 4yk0 4ykj 4ykk 4ymb 4ymg 4ymh 4yml 4ymq 4ymx 4ynb4ynl 4yo8 4yrd 4ysl 4ytc 4yth 4yx4 4yxi 4yyt 4yzu 4z07 4z0k4z0q 4z1e 4z1j 4z1k 4z2b 4z83 4z84 4z93 4zae 4zb6 4zb8 4zba4zbf 4zbi 4zeb 4zec 4zei 4zek 4zgk 4zip 4zji 4zl4 4zls 4zme4zo5 4zow 4zt8 4zv1 4zv2 4zvi 4zw5 4zw6 4zw7 4zw8 4zwx 4zwz4zx0 4zx1 4zx3 4zx4 4zyf 4zzd 4zzx 4zzy 4zzz 5a2i 5a5q 5a6k5a6x 5a7y 5a81 5aa9 5aan 5acy 5ad1 5afv 5ahw 5alb 5am6 5am75amd 5amg 5aml 5ant 5anu 5anv 5aoi 5aoj 5aol 5aqz 5aut 5ave5avf 5ayt 5azf 5b25 5b2d 5b5f 5b5g 5boj 5bry 5bs4 5btv 5btx5bv3 5bw4 5bwc 5byi 5c1m 5c2a 5c2o 5c3p 5c5t 5c8n 5cap 5caq5cas 5cau 5cbm 5cbr 5cbs 5cc2 5cep 5ceq 5chk 5cj6 5cjf 5cks5cp5 5cp9 5cqt 5cqu 5cs3 5cs6 5cso 5csp 5cst 5ct2 5cu4 5cxa5cy9 5czm 5d0c 5d0r 5d1r 5d21 5d24 5d25 5d26 5d2r 5d3c 5d3h5d3j 5d3l 5d3n 5d3p 5d3t 5d3x 5d45 5d47 5d48 5d6j 5dbm 5dex5dey 5dfp 5dgu 5dgw 5dh4 5dh5 5dhu 5dit 5dkn 5dlx 5dnu 5dpx5dq8 5dqc 5dqe 5dqf 5drr 5dus 5duw 5dw2 5dx4 5dxt 5dyo 5e135e1s 5e28 5e2k 5e2l 5e2o 5e2p 5e2r 5e2s 5e3a 5e6o 5e73 5e745e7n 5e89 5e8f 5ect 5edb 5edc 5edd 5edl 5eei 5eek 5een 5ef75efa 5efc 5efh 5efj 5egm 5egu 5eh5 5eh7 5eh8 5ehq 5ehr 5ehv5ehw 5ei3 5eis 5ekm 5el9 5elw 5en3 5epl 5epn 5eq1 5eqe 5eqp5eqy 5er1 5er2 5er4 5etb 5etj 5eu1 5ev8 5evb 5evd 5evk 5evz5ew0 5ewa 5ewk 5ewy 5exl 5exm 5exn 5exw 5ey0 5ey4 5eyr 5f085f0f 5f1h 5f1r 5f1v 5f1x 5f25 5f2p 5f2r 5f2u 5f5z 5f60 5f615f62 5f63 5f74 5f8y 5f9b 5fbi 5fck 5fcz 5fdc 5fdi 5fe6 5fe75fe9 5fh7 5fh8 5fhm 5fhn 5fho 5fl4 5fl5 5fl6 5flo 5flq 5fls5flt 5fnc 5fnd 5fnf 5fng 5fnr 5fns 5fnt 5fnu 5fog 5fol 5fot5fou 5fov 5fox 5fpk 5fs5 5fsn 5fso 5fsx 5fsy 5ftg 5fto 5fut5fwr 5fyx 5g17 5g1a 5g1z 5g2g 5g45 5g46 5g4m 5g4n 5g4o 5g5f5g5z 5g60 5g61 5gj9 5gja 5gmh 5gmn 5gof 5gs9 5gsa 5h1t 5h1u5h1v 5h5f 5h8e 5h8g 5h9r 5ha1 5hbn 5hbs 5hct 5hcv 5hcy 5hi75his 5hjq 5hrv 5hrw 5hrx 5htl 5htz 5hu9 5hva 5hvs 5hvt 5hwu5hwv 5hz5 5hz6 5hz8 5hz9 5i1q 5i29 5i2e 5i2f 5i3a 5i3v 5i3w5i3x 5i3y 5i7x 5i7y 5i80 5i88 5i8g 5i9x 5i9y 5i9z 5ia0 5ia15ia2 5ia3 5ia4 5ia5 5ie1 5igm 5ih9 5ihh 5ii2 5ikb 5ime 5ioz5ipc 5ipj 5irr 5isz 5ito 5itp 5ivc 5ive 5ivv 5ivy 5iwg 5ix05izf 5izj 5j0d 5j1r 5j3l 5j41 5j6a 5j7q 5j7w 5j8z 5ja0 5jfp5jfu 5jg1 5jhb 5ji8 5jop 5jox 5jq5 5js3 5jsg 5jsj 5jsq 5jss5jt9 5jvi 5jxn 5jxq 5jy3 5jzi 5k03 5k0h 5k1d 5k1f 5k8s 5k9w5ka1 5ka7 5ka9 5kab 5kad 5kat 5kax 5kbe 5kby 5kcb 5kej 5khm5kly 5km9 5kma 5ko1 5kqx 5kqy 5kr0 5kr1 5kr2 5kva 5kz0 5l2s5l30 5l3a 5l4i 5l4j 5l4m 5l7e 5l7g 5l7h 5l8a 5l9g 5l9i 5l9l5l9o 5ld8 5ldm 5ldp 5lif 5ljq 5ljt 5lli 5lne 5lom 5lsg 5lsh5lso 5lud 5lvd 5lvl 5lvq 5lvr 5lwd 5lwm 5lyn 5lyr 5lz4 5lz55lz7 5m04 5m17 5m23 5m25 5m28 5m4q 5m5d 5m77 5m7s 5m7u 5m9w5ma7 5meh 5mek 5mes 5mg2 5mge 5mgf 5mgj 5mgk 5mjn 5mkr 5mks5mn1 5mnr 5mo8 5mod 5mpz 5mqe 5mrb 5mrm 5mro 5mrp 5mxf 5my85mz8 5n0d 5n0e 5n0f 5n17 5n18 5n1r 5n1s 5n1z 5n24 5n25 5n2t5n2z 5n31 5n34 5n3v 5n3y 5n6s 5n84 5n93 5n99 5n9r 5nbw 5ndf5ne5 5nea 5neb 5nee 5ngz 5nih 5njz 5nk2 5nk3 5nk4 5nk6 5nk75nk8 5nk9 5nka 5nkb 5nkc 5nkd 5nkg 5nkh 5nki 5nn5 5nn6 5nvv5nvw 5nvx 5nw0 5nw1 5nw2 5nwi 5o07 5o2d 5o4f 5o58 5oei 5oku5oot 5op4 5op5 5oq8 5org 5orv 5orw 5os2 5os4 5os5 5ose 5ot85ot9 5ota 5otc 5ouh 5ovr 5ovx 5std 5sxm 5sym 5sz0 5sz1 5sz25sz3 5sz4 5sz5 5sz6 5sz7 5t19 5t7s 5t8o 5t8p 5t9u 5t9w 5t9z5ta2 5ta4 5tb6 5tbe 5tbm 5tcj 5tcy 5tef 5tfx 5th4 5ti0 5tmp5tp0 5tpx 5tt3 5ttw 5tuo 5tuz 5twj 5txy 5ty9 5tya 5u0d 5u0e5u0f 5u0g 5u0w 5u0y 5u0z 5u11 5u12 5u13 5u14 5u28 5u49 5u4b5u4d 5u6j 5u8c 5ueu 5uez 5uf0 5ufc 5uff 5ufp 5ufr 5ufs 5uk85ula 5uln 5ulp 5ult 5uoo 5uov 5upe 5upf 5upj 5upz 5ut6 5uv2

57


5uxf 5v0n 5v79 5v7a 5v82 5vb5 5vb6 5vb7 5vc3 5vc4 5vcv 5vcw5vcy 5vcz 5vd0 5vd1 5vd2 5vd3 5vgy 5vi6 5vih 5vij 5vkc 5vo15voj 5vp9 5vsf 5vsj 5w1e 5wa8 5wa9 5wal 5wbm 5wbo 5wcm 5we95wex 5wl0 5wlo 5wp5 5wqc 5wuk 5wxh 5wyx 5wyz 5x54 5x62 5x745xg5 5yas 5yjm 6ayi 6b4l 6b4n 6b4u 6b7a 6b7b 6b96 6b97 6b986cpa 6en5 6ep4 6eqp 6equ 6euw 6eux 6ezq 6rnt 6std 6upj 7std7upj 8a3h 8cpa 966c

A.2 Validation data: PDBBind 2018 Core set

1a30 1bcu 1bzc 1e66 1eby 1g2k 1gpk 1gpn 1h22 1h23 1k1i 1lpg1mq6 1nc1 1nc3 1nvq 1o0h 1o3f 1owh 1oyt 1p1n 1p1q 1ps3 1pxn1q8t 1q8u 1qf1 1qkt 1r5y 1s38 1sqa 1syi 1u1b 1uto 1vso 1w4o1y6r 1yc1 1ydr 1ydt 1z6e 1z95 1z9g 2al5 2br1 2brb 2c3i 2cbv2cet 2fvd 2fxs 2hb1 2iwx 2j78 2j7h 2p15 2p4y 2pog 2qbp 2qbq2qbr 2qe4 2qnq 2r9w 2v00 2v7a 2vkm 2vvn 2vw5 2w4x 2w66 2wbg2wca 2weg 2wer 2wn9 2wnc 2wtv 2wvt 2x00 2xb8 2xbv 2xdl 2xii2xj7 2xnb 2xys 2y5h 2yfe 2yge 2yki 2ymd 2zb1 2zcq 2zcr 2zda2zy1 3acw 3ag9 3ao4 3arp 3arq 3b1m 3b27 3b5r 3b65 3b68 3bgz3bv9 3cj4 3coy 3coz 3cyz 3d4z 3dd0 3dx1 3dx2 3e5a 3e92 3e933ebp 3ehy 3ejr 3f3c 3f3d 3f3e 3fcq 3fur 3fv1 3fv2 3g0w 3g2z3g31 3gbb 3gc5 3ge7 3gnw 3gr2 3gv9 3gy4 3ivg 3jvr 3jvs 3jya3k5v 3kgp 3kr8 3kwa 3lka 3mss 3myg 3n76 3n7a 3n86 3nq9 3nx73o9i 3oe4 3oe5 3ozs 3ozt 3p5o 3prs 3pww 3pyy 3qgy 3qqs 3r883rlr 3rr4 3rsx 3ryj 3tsk 3twp 3u5j 3u8k 3u8n 3u9q 3udh 3ueu3uev 3uew 3uex 3ui7 3uo4 3up2 3uri 3utu 3uuo 3wtj 3wz8 3zdg3zso 3zsx 3zt2 4abg 4agn 4agp 4agq 4bkt 4cig 4ciw 4cr9 4cra4crc 4ddh 4ddk 4de1 4de2 4djv 4dld 4e5w 4e6q 4ea2 4eo8 4eor4f09 4f2w 4f3c 4f9w 4gfm 4gid 4gkm 4gr0 4hge 4ih5 4ih7 4ivb4ivc 4ivd 4j21 4j28 4j3l 4jfs 4jia 4jsz 4jxs 4k18 4k77 4kz64kzq 4kzu 4llx 4lzs 4m0y 4mgd 4mme 4mrw 4mrz 4msn 4ogj 4owm4pcs 4qac 4qd6 4rfm 4tmn 4twp 4ty7 4w9c 4w9h 4w9i 4w9l 4wiv5a7b 5aba 5c1w 5c28 5c2h 5dwr 5tmn

A.3 Test set: CSAR HiQ

A.3.1 Set 1

1ax1 1swk 1vot 2add 2are 2b3f 2c1q 2cjp 2hr6 2ihk 2j4k 2jdy2jgb 2nnq 2ou0 2p98 2pzv 2q6m 2qmj 2r3d 2rca 2v7t 2v7v 2v8y2z4b 2zlz 3d2r 3f4j 1iup 1ukb 1w6o 2arb 2b1q 2bbf 2cem 2d2v2idz 2ilz 2jbj 2jff 2jj3 2otz 2p3t 2pjo 2q3c 2qeh 2qvu 2r6w2rde 2v7u 2v8q 2vhw 2z8f 3c7i 3ene

A.3.2 Set 2

1ax2 1bcj 1gi9 1gx0 1gzc 1lhw 1q0y 1q6e 1rdi 1s50 1s9t 1tt11uld 1uzv 1x9d 1xw6 1y93 1z3t 2a3c 2dm5 2fai 2fv5 2hdq 2hj44ubp 1b6m 1bky 1gww 1gz9 1ha2 1ow4 1q4w 1q6g 1rdn 1s7y 1tr71txf 1urg 1w2g 1xl5 1y1m 1yxd 1zhx 2cji 2f5t 2ff1 2hd6 2hdr2i0d

A.4 Test set: CSAR 2014

fxa006 fxa102 fxa401 fxa422 syk222 syk224 syk226 syk249 trmd444 trmd446 trmd448 trmd450trmd452 trmd454 trmd456 trmd458 trmd460 trmd462 trmd464 trmd466 trmd468 trmd470 trmd472 trmd474fxa101 fxa398 fxa406 fxa441 syk223 syk225 syk233 syk250 trmd445 trmd447 trmd449 trmd451trmd453 trmd455 trmd457 trmd459 trmd461 trmd463 trmd465 trmd467 trmd469 trmd471 trmd473

58

Appendix B

Core set clusters

ID PDB Codes ID PDB Codes

0 3ao4, 3zso, 3zsx, 3zt2, 4cig 30 1p1n, 1p1q, 1syi, 2al51 1a30, 1eby, 1g2k, 2qnq, 3o9i 31 1bcu, 1oyt, 2zda, 3bv9, 3utu2 3oe4, 3oe5, 3ozs, 3ozt 32 3u8k, 3u8n, 3wtj, 3zdg, 4qac3 2zb1, 3e92, 3e93, 4f9w 33 1e66, 1gpk, 1gpn, 1h22, 1h234 1bzc, 2hb1, 2qbp, 2qbq, 2qbr 34 1o0h, 1u1b, 1w4o5 2weg, 3dd0, 3kwa, 3ryj, 4jsz 35 2zcq, 2zcr, 2zy1, 3acw, 4ea26 2vvn, 2w4x, 2w66, 2wca, 2xj7 36 1vso, 3fv1, 3fv2, 3gbb, 4dld7 2wvt, 2xii, 4j28, 4jfs, 4pcs 37 3nq9, 3ueu, 3uev, 3uew, 3uex8 1z95, 3b5r, 3b65, 3b68, 3g0w 38 3kr8, 4j21, 4j3l9 2c3i, 3bgz, 3jya, 4k18, 5dwr 39 4e5w, 4ivb, 4ivc, 4ivd, 4k77

10 1k1i, 1o3f, 1uto, 3gy4, 4abg 40 3f3a, 3f3c, 3f3d, 3f3e, 4mme11 2vkm, 3rsx, 3udh, 4djv, 4gid 41 2v7a, 3k5v, 3mss, 3pyy, 4twp12 3arp, 3arq 42 1qf1, 1z9g, 3fcq, 4tmn, 5tmn13 3ebp 43 3cyz14 2v00, 3prs, 3pww, 3uri, 3wz8 44 2cbv, 2cet, 2j78, 2j7h, 2wbg15 1qkt, 2p15, 2pog, 2qe4, 4mgd 45 2wtv, 3e5a, 3myg, 3uo4, 3up216 1r5y, 1s38, 3gc5, 3ge7, 3rr4 46 3ehy, 3lka, 3nx7, 3tsk, 4gr017 4agn, 4agp, 4agq, 5a7b, 5aba 47 3g2z, 3g31, 4de1, 4de218 1ps3, 3d4z, 3dx1, 3dx2, 3ejr 48 3ui7, 3uuo, 4llx, 4mrw, 4mrz,

4msn, 5c1w, 5c28, 5c2h19 1pxn, 2fvd, 2xnb, 3pxf, 4eor 49 4bkt, 4w9c, 4w9h, 4w9i, 4w9l20 1lpg, 1mq6, 1z6e, 2xbv, 2y5h 50 1yc1, 2xdl, 2yki, 3b27, 3rlr21 3coy, 3coz, 3ivg, 4ddh, 4ddk 51 4e6q, 4f09, 4gfm, 4hge, 4jia22 2wn9, 2wnc, 2x00, 2xys, 2ymd 52 4cr9, 4cra, 4crc, 4ty723 1nc1, 1nc3, 1y6r, 4f2w, 4f3c 53 3cj4, 3gnw, 4eo8, 4ih5, 4ih724 2fxs, 2iwx, 2vw5, 2wer, 2yge 54 2p4y, 2yfe, 3b1m, 3fur, 3u9q25 3qgy, 4m0y, 4qd6, 4rfm 55 4kzq, 4kzu26 1owh, 1sqa, 3kgp 56 3p5o, 3u5j, 4lzs, 4ogj, 4wiv27 3qqs, 3r88, 3twp, 4gkm, 4owm 57 1q8t, 1q8u, 1ydr, 1ydt, 3ag928 1nvq, 2br1, 2brb, 3jvr, 3jvs 58 2xb8, 3n76, 3n7a, 3n86, 4ciw29 2r9w, 3gr2, 3gv9, 4jxs, 4kz6

59

Appendix C

Pipeline scripts

C.1 Rosetta relaxation

Listing C.1: thesis/appendix/make ligand pdb params.sh1 # !/ bin/bash2 # Input argument : A Mol2 f i l e3 mol=$ ( basename ”$1” )4 f i lename=$ ( basename ”$1” . mol2 )5 i f [ ! −f ${ f i lename } . params ] ; then6 python2 $ROSETTA/main/source/ s c r i p t s /python/publ ic/mol f i l e to params

. py −n WER −p $fi lename −−conformers−in−one−f i l e −−keep−names −−c lobber $mol

7 f i

Listing C.2: thesis/appendix/make complex pdb.py1 # !/ c l u s t e r /apps/python /3 .6 .0/ x86 64/bin/python32 from prody import parsePDB , writePDB3 from mult iprocess ing import Pool , cpu count4 from p a t h l i b import Path5 import numpy as np6 import sys7 import s t r i n g8 from htmd . molecule . molecule import Molecule9 from htmd . molecule . v o x e l d e s c r i p t o r s import getVoxelDescr iptors

10 from htmd . bui lder . preparat ion import prote inPrepare11

12

13 def g e t f i l e s ( f o l d e r ) :14 f o r pdb in f o l d e r . glob ( ’∗/ ’ ) :15 protein pdb = pdb / f ’{pdb . stem} p r o t e i n . pdb ’16 l igand pdb = pdb / f ’{pdb . stem} l i g a n d . pdb ’17 complex pdb = pdb / f ’{pdb . stem} complex . pdb ’18 i f complex pdb . e x i s t s ( ) :19 continue20 y i e l d ( protein pdb , ligand pdb , complex pdb )21

61

C. Pipeline scripts

22 def make complex pdb ( protein pdb , ligand pdb , complex pdb ) :23 i f complex pdb . e x i s t s ( ) :24 p r i n t ( complex pdb , ” e x i s t s ” )25 re turn26 prote in = parsePDB ( s t r ( protein pdb ) )27 l igand = parsePDB ( s t r ( l igand pdb ) )28 p chains = prote in . getChids ( )29 i n d i v i d u a l c h a i n s = s e t ( p chains )30 p o s s i b l e c h a i n s = s t r i n g . a s c i i u p p e r c a s e . t r a n s l a t e ( s t r . maketrans ( ”

WXZ” , ”wxz” ) ) + s t r i n g . d i g i t s + s t r i n g . a s c i i l o w e r c a s e . t r a n s l a t e (s t r . maketrans ({ ”w” : ”” , ”x” : ”” , ”z” : ”” } ) )

31 c h a i n d i c t = d i c t ( zip ( in d iv idua l ch a in s , p o s s i b l e c h a i n s [ : len (i n d i v i d u a l c h a i n s ) ] ) )

32 prote in . setChids ( np . v e c t o r i z e ( c h a i n d i c t . get ) ( p chains ) )33 l igand . setResnames ( np . array ( [ ’WER’ ]∗ l igand . numAtoms ( ) ) )34 l igand . setChids ( np . array ( [ ’X ’ ]∗ l igand . numAtoms ( ) ) )35 complex = prote in + l igand36 r es = complex . getResnames ( )37 r es [ r es== ’HOH’ ] = ’WAT’38 r es [ r es== ’CYX ’ ] = ’CYS ’39 r es [ r es== ’CYM’ ] = ’CYS ’40 r es [ r es== ’HIE ’ ] = ’ HIS ’41 r es [ r es== ’HID ’ ] = ’ HIS ’42 r es [ r es== ’HSD ’ ] = ’ HIS ’43 r es [ r es== ’HIP ’ ] = ’ HIS ’44 r es [ r es== ’TRQ ’ ] = ’TRP ’45 r es [ r es== ’KCX ’ ] = ’LYS ’46 r es [ r es== ’LLP ’ ] = ’LYS ’47 r es [ r es== ’ARN’ ] = ’ARG’48 r es [ r es== ’ASH ’ ] = ’ASP ’49 r es [ r es== ’GLH’ ] = ’GLU ’50 r es [ r es== ’LYN ’ ] = ’LYS ’51 r es [ r es== ’AR0 ’ ] = ’ARG’52 r es [ r es== ’HSE ’ ] = ’SER ’53 chain = complex . getChids ( )54 chain [ r es==”WAT” ] = ”W”55 f o r metal in [ ”MN” , ”MG” , ”ZN” , ”CA” , ”NA” ] :56 chain [ r es==metal ] = ”Z”57 complex . setResnames ( re s )58 complex . setChids ( chain )59 writePDB ( s t r ( complex pdb ) , complex )60 complex = Molecule ( s t r ( complex pdb ) )61 prot = complex . copy ( )62 prot . f i l t e r ( ” prote in ” )63 l i g = complex . copy ( )64 l i g . f i l t e r ( ” not prote in and same res idue as ( ( resname WAT and

within 3 of resname WER and within 3 of prote in ) or ( resname MNMG ZN CA NA and within 5 of resname WER) or resname WER) ” )

65 prot = prote inPrepare ( prot , pH= 7 . 0 )66 mol = Molecule (name=”complex” )67 mol . append ( prot )68 mol . append ( l i g )69 mol . wri te ( s t r ( complex pdb ) )70 complex = parsePDB ( s t r ( complex pdb ) )

62

C.1. Rosetta relaxation

71 r es = complex . getResnames ( )72 r es [ r es== ’HOH’ ] = ’WAT’73 r es [ r es== ’CYX ’ ] = ’CYS ’74 r es [ r es== ’CYM’ ] = ’CYS ’75 r es [ r es== ’HIE ’ ] = ’ HIS ’76 r es [ r es== ’HID ’ ] = ’ HIS ’77 r es [ r es== ’HSD ’ ] = ’ HIS ’78 r es [ r es== ’HIP ’ ] = ’ HIS ’79 r es [ r es== ’TRQ ’ ] = ’TRP ’80 r es [ r es== ’KCX ’ ] = ’LYS ’81 r es [ r es== ’LLP ’ ] = ’LYS ’82 r es [ r es== ’ARN’ ] = ’ARG’83 r es [ r es== ’ASH ’ ] = ’ASP ’84 r es [ r es== ’GLH’ ] = ’GLU ’85 r es [ r es== ’LYN ’ ] = ’LYS ’86 r es [ r es== ’AR0 ’ ] = ’ARG’87 r es [ r es== ’HSE ’ ] = ’SER ’88 complex . setResnames ( re s )89 writePDB ( s t r ( complex pdb ) , complex )90

91 def process ( args ) :92 t r y :93 make complex pdb (∗ args )94 except Exception as e :95 p r i n t ( e )96 re turn Fa l se97 re turn True98

99 def params ( root , pdb ) :100 protein pdb = root/ pdb / ( pdb+ ’ p r o t e i n . pdb ’ )101 l igand pdb = root/ pdb / ( pdb+ ’ l i g a n d . pdb ’ )102 complex pdb = root/ pdb / ( pdb+ ’ complex . pdb ’ )103 re turn protein pdb , ligand pdb , complex pdb104

105 i f name == ” main ” :106 p a r e n t f o l d e r = # Set root f o l d e r f o r complex f o l d e r s107 p = Pool ( cpu count ( ) )108 p . map( process , g e t f i l e s ( p a r e n t f o l d e r ) )

Listing C.3: thesis/appendix/minimize rosetta.sh1 # !/ bin/bash2 root= # Set root path f o r complex f o l d e r s3 r o s e t t a =$ROSETTA/main/source/bin/ r o s e t t a s c r i p t s . s t a t i c .

l i n u x g c c r e l e a s e4 rose t tadb=$ROSETTA/main/database/5 s c r i p t f o l d e r =$ (pwd)6 export r o s e t t a7 export rose t tadb8 export s c r i p t f o l d e r9

10 func t ion make roset ta ( ) {11 f o l d e r =$ ( dirname ”$1” )12 cd $ ( dirname ”$1” )

63

C. Pipeline scripts

13 i f [ [ $ ? −ne 0 ] ] ; then14 e x i t 115 f i16 i f [ ! −f ” processed ” ] ; then17 touch ” processed ”18 echo ” ################################################ $ (

basename $ f o l d e r ) ”19 complex=$ ( basename ”$1” ) name=$ ( basename ”$1” . pdb ) params=”${

name/complex/l igand } . params” envsubst < $ s c r i p t f o l d e r /f l a g s r e l a x . t x t > ” f l a g s r e l a x . t x t ”

20 echo $ f o l d e r21 python3 $ s c r i p t f o l d e r / g e t c l o s e s t l i g a t o m . py ./ $ ( basename ”$1”

) ./ c o n s t r a i n t s22 ( $ r o s e t t a @ f l a g s r e l a x . t x t −parser : pro toco l ” $ s c r i p t f o l d e r /

r e l a x . xml” −database $rose t tadb ) && echo ” Finished $ ( basename$ f o l d e r ) c o r r e c t l y ”

23 f i24 }25 export −f make roset ta26

27 f ind $root −name ”∗ complex . pdb” | shuf | p a r a l l e l − j 24 −n 1 −−ungroup bash −c ” : && make roset ta {}”

Listing C.4: thesis/appendix/make protein pdb.py1 from prody import parsePDB , writePDB2 from mult iprocess ing import Pool3 from p a t h l i b import Path4 import numpy as np5 import sys6

7 def g e t f i l e s ( f o l d e r ) :8 f o r complex pdb in f o l d e r . glob ( ’∗/∗ complex ∗ . pdb ’ ) :9 protein pdb = complex pdb . parent / complex pdb . name . r e p l a c e ( ’

complex ’ , ’ pro te in ’ )10 i f not protein pdb . e x i s t s ( ) :11 y i e l d complex pdb12

13 def make complex pdb ( complex pdb ) :14 complex = parsePDB ( s t r ( complex pdb ) )15 protein pdb = complex pdb . parent / complex pdb . name . r e p l a c e ( ’

complex ’ , ’ pro te in ’ )16 prote in = complex . s e l e c t ( ’ not resname WER’ )17 writePDB ( s t r ( protein pdb ) , pro te in )18

19 def process ( args ) :20 p r i n t ( args )21 t r y :22 make complex pdb ( args )23 except Exception as e :24 p r i n t ( e )25 re turn Fa l se26 re turn True27

64


28 i f name == ” main ” :29 root = # Set root path f o r complex f o l d e r s30 p = Pool ( 4 8 )31 p . map( process , g e t f i l e s ( p a r e n t f o l d e r ) )

Listing C.5: thesis/appendix/make ligand mol2 renamed.py1 import pdb2 from p a t h l i b import Path3 from mult iprocess ing import Pool4

5

6 def read pdb ( pdb path ) :7 with pdb path . open ( ’ r ’ ) as f :8 l i n e s = f . r e a d l i n e s ( )9 hetatm = f i l t e r ( lambda x : x . s t a r t s w i t h ( ’HETATM’ ) , l i n e s )

10 atom num name = map( lambda x : x . s p l i t ( ) [ 1 : 3 ] , hetatm )11 re turn d i c t ( atom num name )12

13 def read mol2 ( mol2 path , name map ) :14 with mol2 path . open ( ’ r ’ ) as f :15 l i n e s = f . r e a d l i n e s ( )16 mode = ” search ”17 f o r i , l i n e in enumerate ( l i n e s ) :18 i f mode == ” search ” :19 i f l i n e . s t a r t s w i t h ( ’@<TRIPOS>ATOM’ ) :20 mode = ”rename”21 e l i f mode == ”rename” :22 i f l i n e . s t a r t s w i t h ( ’@<TRIPOS>BOND’ ) :23 mode = ”end”24 e l s e :25 atom num , atom name = l i n e . s p l i t ( ) [ 0 : 2 ]26 new name = name map [ atom num ] . l j u s t ( len ( atom name ) )27 p o s i t i o n = l i n e . f ind ( atom name )28 end = p o s i t i o n + len ( new name )29 new line = l i n e [ 0 : p o s i t i o n ] + new name + l i n e [ end : ]30 l i n e s [ i ] = new line31 e l i f mode == ”end” :32 re turn l i n e s33

34 def p r o c e s s f i l e ( path ) :35 pdb code = path . stem36 new mol2 path = path / f ”{pdb code} l igand renamed . mol2”37 t r y :38 name map = read pdb ( path / f ”{pdb code} l i g a n d . pdb” )39 with new mol2 path . open ( ’w’ ) as f :40 f . wri te ( ”” . j o i n ( read mol2 ( path / f ”{pdb code} l i g a n d . mol2” ,

name map ) ) )41 except :42 p r i n t ( e )43

44 def g e t f i l e s ( path ) :45 re turn path . glob ( ’∗/ ’ )46

65

C. Pipeline scripts

47 i f name == ” main ” :48 p = Pool ( 4 8 )49 root=# Set root path f o r complex f o l d e r s50 p . map( p r o c e s s f i l e , g e t f i l e s ( root ) )

Listing C.6: thesis/appendix/make ligand mol2.sh1 # !/ bin/bash2 root= # Set root path f o r complex f o l d e r s3 s c r i p t =$ROSETTA/main/source/ s r c /apps/publ ic/l igand docking/

p d b t o m o l f i l e . py4 export s c r i p t5

6 func t ion make ligand mol2 ( ) {7 d i r=$ ( dirname $1 )8 complex=$19 name=$ ( basename ”$1” . pdb )

10 code=$ ( echo $name | cut −f 1 −d ” ” )11 mol=” $dir/${code} l igand renamed . mol2”12 out=$dir/${name/complex/l igand } . mol213 python2 $ s c r i p t $mol $complex > $dir/${name/complex/l igand } . mol214 }15 export −f make ligand mol216

17 f ind ${ root } −name ”∗ complex ∗ . pdb” −maxdepth 2 | p a r a l l e l −n 1bash −c ” : && make ligand mol2 {}”

Listing C.7: thesis/appendix/flags relax.txt1 −in2 − f i l e3 −s ./$ complex4 −e x t r a r es fa ./$ params5 −packing6 −ex17 −ex1 aro8 −ex29 −mute core . u t i l . prof ## dont show timing i n f o

10 −mute core . io . database11 #−mute a l l12 #−unmute p r o t o c o l s . jd 2 . J o b D i s t r i b u t o r13 −c o n s t r a i n t s : c s t fa f i l e ./ c o n s t r a i n t s14 −score : s e t weights atom pair c o n s t r a i n t 1 . 015 −in : auto setup metals16 −in : metals angle c o n s t r a i n t m u l t i p l i e r 3 . 017 −in : metals d i s t a n c e c o n s t r a i n t m u l t i p l i e r 3 . 018 −in : ignore waters f a l s e19 −ignore zero occupancy f a l s e20 −keep input protonat ion s t a t e t rue21 −n s t r u c t 1022 −overwrite

Listing C.8: thesis/appendix/get constraint.py

66


1 from prody import ∗2 import numpy as np3 import sys4

5 def process ( f i l e , output ) :6 complex = parsePDB ( f i l e )7 metals = complex . s e l e c t ( ” chain Z” )8 # import pdb ; pdb . s e t t r a c e ( )9 r e s u l t = [ ]

10 i f metals :11 f o r atom in metals :12 pos = atom . getCoords ( )13 c l o s e l i g a n d = complex . s e l e c t ( ”resname WER and ( not ( element H

or element C) ) and within 2 . 5 of t ” , t =pos )14 i f c l o s e l i g a n d :15 # c l o s e = c l o s e l i g a n d [ np . argmin ( np . l i n a l g . norm ( c l o s e l i g a n d .

getCoords ( ) − pos , a x i s =1) ) ]16 f o r c l o s e in c l o s e l i g a n d :17 r e s u l t . append ( ( atom . getName ( ) , s t r ( atom . getResnum ( ) ) +atom .

getChid ( ) , c l o s e . getName ( ) , s t r ( c l o s e . getResnum ( ) ) + c l o s e . getChid( ) ) )

18 with open ( output , ”w” ) as f :19 f o r r in r e s u l t :20 f . wri te ( f ”AtomPair { r [ 0 ]} { r [ 1 ]} { r [ 2 ]} { r [ 3 ]} SQUARE WELL 2 . 5

−2000\n” )21

22

23 i f name == ” main ” :24 process ( sys . argv [ 1 ] , sys . argv [ 2 ] )

Listing C.9: thesis/appendix/relax.xml1 <ROSETTASCRIPTS>2 <SCOREFXNS>3 </SCOREFXNS>4 <FILTERS>5 </FILTERS>6 <TASKOPERATIONS>7 <D e t e c t P r o t e i n L i g a n d I n t e r f a c e name=” l i g a n d I n t e r f a c e ” cut1=”

0 . 0 ” cut2=” 0 . 0 ” cut3=” 1 6 . 0 ” cut4=” 2 0 . 0 ”8 design=”0” c a t r e s i n t e r f a c e =”0” c a t r e s o n l y i n t e r f a c e =”0

” a r g s w e e p i n t e r f a c e =”0”/>9 </TASKOPERATIONS>

10 <MOVERS>11 <ConstraintSetMover name=” c o n s t r a i n t ” a d d c o n s t r a i n t s =” true

” c s t f i l e =” ./ c o n s t r a i n t s ”/>12 <EnzRepackMinimize name=” r e l a x ” score fxn repack=”REF2015”

scorefxn minimize=”REF2015” c s t o p t =”0” design=”0” repack only=”1” f i x c a t a l y t i c =”0” minimize rb=”1” minimize bb=”1”minimize sc=”1” minimize l ig=”1” min in s tages=”0” backrub=”0”c y c l e s =”1” t a s k o p e r a t i o n s =” l i g a n d I n t e r f a c e ”/>

13 </MOVERS>14 <PROTOCOLS>15 <Add mover name=” c o n s t r a i n t ”/>

67

C. Pipeline scripts

16 <Add mover name=” r e l a x ”/>17 </PROTOCOLS>18 </ROSETTASCRIPTS>

C.2 Precompute Rosetta energies

Listing C.10: thesis/appendix/compute rosetta energy.py1 # !/ usr/bin/env python32

3 from f u t u r e import p r i n t f u n c t i o n4

5 import argparse6 import os7 import h5py8 from pyrose t ta . r o s e t t a . p r o t o c o l s . scor ing import I n t e r f a c e9 from pyrose t ta . r o s e t t a import ∗

10 from pyrose t ta import ∗11 from p a t h l i b import Path12 import numpy as np13 from mult iprocess ing import Pool , cpu count14 from c o l l e c t i o n s import d e f a u l t d i c t15 from pyrose t ta . toolbox . atom pair energy import

p r i n t r e s i d u e p a i r e n e r g i e s16 i n i t ( ’−in : auto se tup meta ls ’ ) #−mute core . conformation . Conformation

’ )17

18 def compute atom pair energy ( pdb filename , ligand params ,i n t e r f a c e c u t o f f = 2 1 . 0 ) :

19 i f type ( ligand params ) i s s t r :20 l igand params = [ ligand params ]21 l igand params = Vector1 ( [ s t r ( l igand params ) ] )22

23 pose = Pose ( )24 r e s s e t = pose . conformation ( ) . m o d i f i a b l e r e s i d u e t y p e s e t f o r c o n f

( )25 r e s s e t . r e a d f i l e s f o r b a s e r e s i d u e t y p e s ( ligand params )26

27 pose . conformation ( ) . r e s e t r e s i d u e t y p e s e t f o r c o n f ( r e s s e t )28 p o s e f r o m f i l e ( pose , s t r ( pdb filename ) )29 score fxn = c r e a t e s c o r e f u n c t i o n ( ’ re f2015 ’ )30 pose score = score fxn ( pose )31

32 # d e t e c t i n t e r f a c e33 f o l d t r e e = pose . f o l d t r e e ( )34 f o r jump in range ( 1 , pose . num jump ( ) +1) :35 name = pose . res idue ( f o l d t r e e . downstream jump residue ( jump ) ) .

name ( )36 i f name == ’WER’ :37 break38 i n t e r f a c e = I n t e r f a c e ( jump )39 i n t e r f a c e . d i s t a n c e ( i n t e r f a c e c u t o f f )40 i n t e r f a c e . c a l c u l a t e ( pose )

68

C.2. Precompute Rosetta energies

41

42 energ ies = [ ]43 en = d e f a u l t d i c t ( lambda : np . zeros ( ( 1 , 4 ) ) )44 keys = [ ]45 f o r rnum1 in range ( 1 , pose . t o t a l r e s i d u e ( ) + 1) :46 i f i n t e r f a c e . i s i n t e r f a c e ( rnum1 ) :47 r1 = pose . res idue ( rnum1 )48 f o r a1 in range ( 1 , len ( r1 . atoms ( ) ) + 1) :49 seq1 = pose . pdb info ( ) . pose2pdb ( rnum1 ) . s t r i p ( ) . r e p l a c e ( ’ ’ , ’

− ’ )50 at1 = r1 . atom name ( a1 ) . s t r i p ( )51 key1 = seq1 + ’− ’ + at152 f o r rnum2 in range ( rnum1+1 , pose . t o t a l r e s i d u e ( ) + 1) :53 i f i n t e r f a c e . i s i n t e r f a c e ( rnum2 ) :54 r2 = pose . res idue ( rnum2 )55 f o r a2 in range ( 1 , len ( r2 . atoms ( ) ) +1) :56 seq2 = pose . pdb info ( ) . pose2pdb ( rnum2 ) . s t r i p ( ) . r e p l a c e

( ’ ’ , ’− ’ )57 at2 = r2 . atom name ( a2 ) . s t r i p ( )58 key2 = seq2 + ’− ’ + at259 ee = e t a b l e a t o m p a i r e n e r g i e s ( r1 , a1 , r2 , a2 ,

score fxn )60 i f a l l ( e == 0 . 0 f o r e in ee ) :61 continue62 en [ key1 ] += np . array ( ee )63 en [ key2 ] += np . array ( ee )64 energy matrix = np . array ( [ v f o r v in en . values ( ) ] )65 re turn l i s t ( en . keys ( ) ) , energy matrix66

67 def g e t r a d i i a n d c h a r g e s ( pdb filename , ligand params ) :68 keys = [ ]69 charges = [ ]70 r a d i i = [ ]71

72 i f type ( ligand params ) i s s t r :73 l igand params = [ ligand params ]74 l igand params = Vector1 ( [ s t r ( l igand params ) ] )75

76 pose = Pose ( )77 r e s s e t = pose . conformation ( ) . m o d i f i a b l e r e s i d u e t y p e s e t f o r c o n f

( )78 r e s s e t . r e a d f i l e s f o r b a s e r e s i d u e t y p e s ( ligand params )79

80 pose . conformation ( ) . r e s e t r e s i d u e t y p e s e t f o r c o n f ( r e s s e t )81 p o s e f r o m f i l e ( pose , s t r ( pdb filename ) )82 f o r rnum1 in range ( 1 , pose . t o t a l r e s i d u e ( ) + 1) :83 r1 = pose . res idue ( rnum1 )84 f o r a1 in range ( 1 , len ( r1 . atoms ( ) ) + 1) :85 seq1 = pose . pdb info ( ) . pose2pdb ( rnum1 ) . s t r i p ( ) . r e p l a c e ( ’ ’

, ’− ’ )86 at1 = r1 . atom name ( a1 ) . s t r i p ( )87 key1 = seq1 + ’− ’ + at188 charges . append ( r1 . atomic charge ( a1 ) )89 r a d i i . append ( r1 . atom type ( a1 ) . l j r a d i u s ( ) )

69

C. Pipeline scripts

90 keys . append ( key1 )91

92 re turn keys , charges , r a d i i93

94 def e x t r a c t a n d s a v e ( p d b f i l e ) :95 f o l d e r = p d b f i l e . parent96 pdb code = f o l d e r . stem97 l igand params = f o l d e r / f ’{pdb code} l i g a n d . params ’98 o u t p u t f i l e = f o l d e r / ( p d b f i l e . stem + ’ . a t t r ’ )99 t r y :

100 e keys , e va lues = compute atom pair energy ( p d b f i l e ,l igand params )

101 rc keys , charges , r a d i i = g e t r a d i i a n d c h a r g e s ( p d b f i l e ,l igand params )

102 except Exception as e :103 p r i n t ( ” Error a t ” , p d b f i l e )104 p r i n t ( e )105 re turn106 energy keys = np . array ( e keys )107 energy values = np . array ( e va lues )108 rc keys = np . array ( rc keys )109 ra d ius va lues = np . array ( r a d i i )110 charge values = np . array ( charges )111 np . savez compressed ( s t r ( o u t p u t f i l e ) ,112 energy keys=e keys ,113 energy values=energy values ,114 rc keys=rc keys ,115 ra d ius va lue s=radius values ,116 charge values=charge values )117 p r i n t ( ”COMPLETED” , p d b f i l e )118

119 def u p d a t e r a d i i c h a r g e s ( p d b f i l e ) :120 f o l d e r = p d b f i l e . parent121 pdb code = f o l d e r . stem122 p r i n t ( pdb code )123 l igand params = f o l d e r / f ’{pdb code} l i g a n d . params ’124 o u t p u t f i l e = f o l d e r / ( p d b f i l e . stem + ’ . a t t r ’ )125 rc keys , charges , r a d i i = g e t r a d i i a n d c h a r g e s ( p d b f i l e ,

l igand params )126 rc keys = np . array ( rc keys )127 ra d ius va lues = np . array ( r a d i i )128 charge values = np . array ( charges )129 old data = np . load ( s t r ( o u t p u t f i l e ) + ’ . npz ’ )130 e keys = old data [ ’ energy keys ’ ]131 energy values = old data [ ’ energy values ’ ]132 np . savez compressed ( s t r ( o u t p u t f i l e ) ,133 energy keys=e keys ,134 energy values=energy values ,135 rc keys=rc keys ,136 ra d ius va lue s=radius values ,137 charge values=charge values )138

139

140

70

C.3. CNN Architectures

141 def g e t f i l e s ( f o l d e r ) :142 re turn [ x f o r x in f o l d e r . glob ( ’∗/∗ complex ∗ . pdb ’ ) i f not ( x .

parent /( x . stem + ’ . a t t r . npz ’ ) ) . e x i s t s ( ) ]143

144 def g e t f i x f i l e s ( f o l d e r ) :145 re turn [ x f o r x in f o l d e r . glob ( ’∗/∗ complex ∗ . pdb ’ ) i f ( x . parent /(

x . stem + ’ . a t t r . npz ’ ) ) . e x i s t s ( ) ]146

147 i f name == ” main ” :148 root = # Set root path f o r complex f o l d e r s149 p r i n t ( cpu count ( ) )150 p = Pool ( cpu count ( ) //3)151 p . map( ex t rac t and save , g e t f i l e s ( root ) )

C.3 CNN Architectures

Listing C.11: thesis/appendix/original kdeep.py1 mport tensorf low as t f2

3 LEARNING RATE = 0.00014

5 def f i re module ( net , squeeze , expand , t r a i n i n g ) :6 net = t f . l a y e r s . conv3d ( net , squeeze , [ 1 , 1 , 1 ] , a c t i v a t i o n = t f . nn .

r e l u )7 net1 = t f . l a y e r s . conv3d ( net , expand , [ 1 , 1 , 1 ] , a c t i v a t i o n = t f . nn .

r e l u )8 net2 = t f . l a y e r s . conv3d ( net , expand , [ 3 , 3 , 3 ] , padding= ’ same ’ ,

a c t i v a t i o n = t f . nn . r e l u )9 re turn t f . concat ( a x i s =−1, values =[ net1 , net2 ] )

10

11 def conv net (X , reuse , t r a i n i n g ) :12 with t f . v a r i a b l e s c o p e ( ’ SqueezeNet ’ , reuse=reuse ) :13 p r i n t (X)14 net = t f . l a y e r s . conv3d (X , 96 , 1 , 2 , padding= ’ same ’ , a c t i v a t i o n =

t f . nn . r e l u )15 p r i n t ( net )16 net = f ire module ( net , 16 , 64 , t r a i n i n g = t r a i n i n g )17 net = f ire module ( net , 16 , 64 , t r a i n i n g = t r a i n i n g )18 net = f ire module ( net , 32 , 128 , t r a i n i n g = t r a i n i n g )19 net = t f . l a y e r s . max pooling3d ( net , 3 , 2 )20 p r i n t ( net )21 net = f ire module ( net , 32 , 128 , t r a i n i n g = t r a i n i n g )22 net = f ire module ( net , 48 , 192 , t r a i n i n g = t r a i n i n g )23 net = f ire module ( net , 48 , 192 , t r a i n i n g = t r a i n i n g )24 net = f ire module ( net , 64 , 256 , t r a i n i n g = t r a i n i n g )25 net = t f . l a y e r s . average pooling3d ( net , 3 , 2 )26 net = t f . l a y e r s . f l a t t e n ( net )27 net = t f . l a y e r s . dense ( net , 1 )28 re turn net29

30 def model fn ( f e a t u r e s , l a b e l s , mode , r o t a t e ) :31 t r a i n i n g = mode == t f . es t imator . ModeKeys . TRAIN

71

C. Pipeline scripts

32 p r e d i c t i o n s = conv net ( f e a t u r e s , reuse= t f . AUTO REUSE, t r a i n i n g =t r a i n i n g )

33 l o s s = None34 t r a i n o p = None35 i f t r a i n i n g :36 l o s s = t f . l o s s e s . mean squared error ( l a b e l s = l a b e l s , p r e d i c t i o n s

= p r e d i c t i o n s )37 optimizer = t f . t r a i n . AdamOptimizer ( l e a r n i n g r a t e =LEARNING RATE

)38 t r a i n o p = optimizer . minimize ( loss , g l o b a l s t e p = t f . t r a i n .

g e t g l o b a l s t e p ( ) )39 e l s e :40 i f r o t a t e :41 p r e d i c t i o n s = t f . reduce mean ( t f . reshape ( predic t ions

, ( −1 ,24 ,1 ) ) , a x i s =1)42 l o s s = t f . l o s s e s . mean squared error ( l a b e l s , p r e d i c t i o n s )43 e l s e :44 l o s s = t f . l o s s e s . mean squared error ( l a b e l s = l a b e l s ,

p r e d i c t i o n s = p r e d i c t i o n s )45

46 re turn t f . es t imator . Est imatorSpec (47 mode=mode ,48 p r e d i c t i o n s =predic t ions ,49 l o s s =loss ,50 t r a i n o p = t r a i n o p )

Listing C.12: thesis/appendix/large kdeep.py1 mport tensorf low as t f2


5 def f i re module ( net , squeeze , expand , t r a i n i n g ) :6 net = t f . l a y e r s . conv3d ( net , squeeze , [ 1 , 1 , 1 ] , a c t i v a t i o n = t f . nn .

r e l u )7 net1 = t f . l a y e r s . conv3d ( net , expand , [ 1 , 1 , 1 ] , a c t i v a t i o n = t f . nn .

r e l u )8 net2 = t f . l a y e r s . conv3d ( net , expand , [ 3 , 3 , 3 ] , padding= ’ same ’ ,

a c t i v a t i o n = t f . nn . r e l u )9 re turn t f . concat ( a x i s =−1, values =[ net1 , net2 ] )

10

11 def conv net (X , reuse , t r a i n i n g ) :12 with t f . v a r i a b l e s c o p e ( ’ SqueezeNet ’ , reuse=reuse ) :13 p r i n t (X)14 net = t f . l a y e r s . conv3d (X , 96 , 7 , 2 , padding= ’ same ’ , a c t i v a t i o n =

t f . nn . r e l u )15 p r i n t ( net )16 net = f ire module ( net , 16 , 64 , t r a i n i n g = t r a i n i n g )17 net = f ire module ( net , 16 , 64 , t r a i n i n g = t r a i n i n g )18 net = f ire module ( net , 32 , 128 , t r a i n i n g = t r a i n i n g )19 net = t f . l a y e r s . max pooling3d ( net , 3 , 2 )20 p r i n t ( net )21 net = f ire module ( net , 32 , 128 , t r a i n i n g = t r a i n i n g )22 net = f ire module ( net , 48 , 192 , t r a i n i n g = t r a i n i n g )

72

C.3. CNN Architectures

23 net = f ire module ( net , 48 , 192 , t r a i n i n g = t r a i n i n g )24 net = f ire module ( net , 64 , 256 , t r a i n i n g = t r a i n i n g )25 net = t f . l a y e r s . average pooling3d ( net , 3 , 2 )26 net = t f . l a y e r s . f l a t t e n ( net )27 net = t f . l a y e r s . dense ( net , 1 )28 re turn net29

30 def model fn ( f e a t u r e s , l a b e l s , mode , r o t a t e ) :31 t r a i n i n g = mode == t f . es t imator . ModeKeys . TRAIN32 p r e d i c t i o n s = conv net ( f e a t u r e s , reuse= t f . AUTO REUSE, t r a i n i n g =

t r a i n i n g )33 l o s s = None34 t r a i n o p = None35 i f t r a i n i n g :36 l o s s = t f . l o s s e s . mean squared error ( l a b e l s = l a b e l s , p r e d i c t i o n s


)38 t r a i n o p = optimizer . minimize ( loss , g l o b a l s t e p = t f . t r a i n .


, ( −1 ,24 ,1 ) ) , a x i s =1)42 l o s s = t f . l o s s e s . mean squared error ( l a b e l s , p r e d i c t i o n s )43 e l s e :44 l o s s = t f . l o s s e s . mean squared error ( l a b e l s = l a b e l s ,

p r e d i c t i o n s = p r e d i c t i o n s )45


Listing C.13: thesis/appendix/resnet 101.py1 import tensorf low as t f2


5

6 def s t a r t b l o c k ( net , channels , t r a i n i n g ) :7 input ne t = t f . l a y e r s . conv3d ( net , 4∗ channels , 1 , 2 , padding= ’ same ’

)8 net = conv3bn ( net , channels , 1 , t r a i n i n g , s t r i d e =2)9 net = conv3bn ( net , channels , 3 , t r a i n i n g )

10 net = conv3bn ( net , 4∗ channels , 1 , t r a i n i n g )11 output net = net + input ne t12 re turn output net13

14 def i n n e r b l o c k ( net , channels , t r a i n i n g ) :15 input ne t = t f . l a y e r s . conv3d ( net , 4∗ channels , 1 , padding= ’ same ’ )16 net = conv3bn ( net , channels , 1 , t r a i n i n g )

73

C. Pipeline scripts

17 net = conv3bn ( net , channels , 3 , t r a i n i n g )18 net = conv3bn ( net , 4∗ channels , 1 , t r a i n i n g )19 output net = net + input ne t20 re turn output net21

22

23 def conv3bn ( net , channels , f i l t , t r a i n i n g , s t r i d e =1) :24 # net = t f . l a y e r s . ba tch normal iza t ion ( net , t r a i n i n g = t r a i n i n g )25 net = t f . nn . r e l u ( net )26 net = t f . l a y e r s . conv3d ( net , channels , f i l t , s t r i d e , padding= ’ same ’

)27 re turn net28

29 def conv net (X , reuse , t r a i n i n g ) :30 with t f . v a r i a b l e s c o p e ( ’ ResNet ’ , reuse=reuse ) :31 l a y e r s = [ 3 , 4 , 2 3 , 3 ]32 k = 6433 net = t f . l a y e r s . conv3d (X , k , 7 , 2 , padding= ’ same ’ )34 # net = t f . l a y e r s . ba tch normal iza t ion ( net , t r a i n i n g = t r a i n i n g )35 net = t f . nn . r e l u ( net )36 net = t f . l a y e r s . max pooling3d ( net , 3 , 2 )37 f o r i in range ( 0 , l a y e r s [ 0 ] ) :38 net = i n n e r b l o c k ( net , k , t r a i n i n g )39 f o r i , l in enumerate ( l a y e r s [ 1 : ] , 1 ) :40 net = s t a r t b l o c k ( net , k∗ (2∗∗ i ) , t r a i n i n g )41 f o r j in range ( 0 , l −1) :42 net = i n n e r b l o c k ( net , k∗ (2∗∗ i ) , t r a i n i n g )43 net = t f . reduce mean ( net , a x i s = ( 1 , 2 , 3 ) )44 net = t f . l a y e r s . f l a t t e n ( net )45 net = t f . l a y e r s . dense ( net , 1 )46 re turn net47

48 def model fn ( f e a t u r e s , l a b e l s , mode , r o t a t e ) :49 t r a i n i n g = mode == t f . es t imator . ModeKeys . TRAIN50 p r e d i c t i o n s = conv net ( f e a t u r e s , reuse= t f . AUTO REUSE, t r a i n i n g =

t r a i n i n g )51

52 l o s s = None53 t r a i n o p = None54 i f t r a i n i n g :55 l o s s = t f . l o s s e s . mean squared error ( l a b e l s = l a b e l s , p r e d i c t i o n s


)57 update ops = t f . g e t c o l l e c t i o n ( t f . GraphKeys . UPDATE OPS)58 with t f . contro l dependencies ( update ops ) :59 t r a i n o p = optimizer . minimize ( loss , g l o b a l s t e p = t f . t r a i n .


, ( −1 ,24 ,1 ) ) , a x i s =1)63 l o s s = t f . l o s s e s . mean squared error ( l a b e l s , p r e d i c t i o n s )64 e l s e :

74

C.4. Feature generation

65 l o s s = t f . l o s s e s . mean squared error ( l a b e l s = l a b e l s ,p r e d i c t i o n s = p r e d i c t i o n s )

66


C.4 Feature generation

Listing C.14: thesis/appendix/make htmd features.py1 from p a t h l i b import Path2 import numpy as np3 from c o l l e c t i o n s import d e f a u l t d i c t4 import h5py5 from htmd . molecule . molecule import Molecule6 from htmd . molecule . v o x e l d e s c r i p t o r s import getVoxelDescr iptors7 from mult iprocess ing import Pool , cpu count8 from htmd . bui lder . preparat ion import prote inPrepare9

10 def save gr id ( saving path , pdb , image ) :11 with h5py . F i l e ( s t r ( saving path ) , ”w” , l i b v e r = ’ l a t e s t ’ ) as f :12 f . c r e a t e d a t a s e t ( ” grid ” , dtype= ’ f4 ’ , data=image )13

14 def build images ( p r o t e i n f i l e , l i g a n d f i l e , s i z e ) :15 prote in = Molecule ( s t r ( p r o t e i n f i l e ) )16 l igand = Molecule ( s t r ( l i g a n d f i l e ) )17 prote in . f i l t e r ( ’ not ( water or name CO or name NI or name CU or

name NA) ’ )18 # prote in . f i l t e r ( ’ not name HG and not name CD and not name K and

not name CU and not name SE and not name LI and not name NI andnot name CO and not name CS and not name SR and not name MN andnot name NA ’ )

19 c e n t e r = np . mean( l igand . get ( ’ coords ’ ) , a x i s =0)20 l igand . moveBy(− c e n t e r )21 prote in . moveBy(− c e n t e r )22 s ize ang = 1223 c e n t e r = np . mean( l igand . get ( ’ coords ’ ) , a x i s =0)24 ex min = c e n t e r − s ize ang25 ex max = c e n t e r + s ize ang26 x = np . l i n s p a c e ( ex min [ 0 ] , ex max [ 0 ] , s i z e )27 y = np . l i n s p a c e ( ex min [ 1 ] , ex max [ 1 ] , s i z e )28 z = np . l i n s p a c e ( ex min [ 2 ] , ex max [ 2 ] , s i z e )29 p o s i t i o n m a t r i x = np . s tack ( np . meshgrid ( x , y , z , indexing= ’ i j ’ ) ,

a x i s =−1) . reshape ((−1 ,3) )30 p r o t f e a t u r e s = getVoxelDescr iptors ( protein , u s e r c e n t e r s =

p o s i t i o n m a t r i x ) [ 0 ] . reshape ( 3∗ ( s ize , ) + (−1 ,) )31 i n h f e a t u r e s = getVoxe lDescr iptors ( l igand , u s e r c e n t e r s =

p o s i t i o n m a t r i x ) [ 0 ] . reshape ( 3∗ ( s ize , ) + (−1 ,) )32 f i n a l g r i d = np . concatenate ( ( p r o t f e a t u r e s , i n h f e a t u r e s ) , a x i s =−1)33 # import pdb ; pdb . s e t t r a c e ( )

75

C. Pipeline scripts

34 re turn f i n a l g r i d35

36 def g e t f i l e s ( f o l d e r ) :37 f o r p r o t e i n f i l e in f o l d e r . glob ( ’∗/∗ p r o t e i n 0 ∗ . pdbqt ’ ) :38 pdb = p r o t e i n f i l e . parent39 t r y :40 l i g a n d f i l e = next ( pdb . glob ( ’∗ l i g a n d ∗ . pdbqt ’ ) )41 except :42 continue43 y i e l d ( pdb . s t e t r y :44 image = build images ( prot , l i g , s i z e )45 except Exception as e :46 p r i n t ( e )47 re turn Fa l se48 save gr id ( saving path , pdb , image )49 re turn True50

51 i f name == ” main ” :52 s i z e = 2553 root = # Set root path f o r complex f o l d e r s54 s a v i n g f o l d e r = # Saving HDF5 f o l d e r55 s a v i n g f o l d e r . mkdir ( parents=True , e x i s t o k =True )56 p = Pool ( cpu count ( ) )57 p . map( p r o c e s s f i l e , g e t f i l e s ( root ) )58 m, p r o t e i n f i l e , l i g a n d f i l e )59

60 def p r o c e s s f i l e ( params ) :61 pdb , prot , l i g = params62 saving path = s a v i n g f o l d e r / ( prot . name . r e p l a c e ( ’ . pdbqt ’ , ’ . hdf5 ’ )

. r e p l a c e ( ’ prote in ’ , ’ complex ’ ) )63 i f saving path . e x i s t s ( ) :64 re turn True65 p r i n t ( pdb )

Listing C.15: thesis/appendix/make rosetta features.py1 from operator import i t e m g e t t e r2 from prody import parsePDB , calcCenter , parseDCD , moveAtoms3 import i t e r t o o l s4 from apbs import APBS5 from c o l l e c t i o n s import d e f a u l t d i c t6 from mult iprocess ing import Pool , cpu count7 from p a t h l i b import Path8 import numpy as np9 from c o l l e c t i o n s import d e f a u l t d i c t

10 import h5py11 from scipy . i n t e r p o l a t e import Rbf12 from scipy . s p a t i a l . d i s t a n c e import c d i s t13

14 def save gr id ( saving path , image ) :15 with h5py . F i l e ( s t r ( saving path ) , ”w” , l i b v e r = ’ l a t e s t ’ ) as f :16 f . c r e a t e d a t a s e t ( ” grid ” , dtype= ’ f4 ’ , data=image )17

18 def g e t f i l e s ( f o l d e r ) :

76


19 p r i n t ( f o l d e r )20 re turn f o l d e r . glob ( ’∗/∗ complex ∗ . pdb ’ )21

22 def import data ( pdb path ) :23 data path = s t r ( pdb path ) . r e p l a c e ( ’ . pdb ’ , ’ . a t t r . npz ’ )24 data = np . load ( data path )25 r a d i i = d i c t ( zip ( data [ ’ r c keys ’ ] , data [ ’ r ad i us va lues ’ ] ) )26 charges = d i c t ( zip ( data [ ’ r c keys ’ ] , data [ ’ charge values ’ ] ) )27 energ ies = data [ ’ energy values ’ ] . squeeze ( )28 energy keys = data [ ’ energy keys ’ ]29 re turn r a d i i , charges , energy keys , energ ies30

31 from scipy . s p a t i a l import KDTree32 def a p p l y f i l t e r ( f i l t e r t y p e , points , values , t a r g e t s , r a d i i ) :33 c = 034 mask = np . l i n a l g . norm ( points , a x i s =−1) <= 12 .5∗np . s q r t ( 3 )35 points = points [ mask ]36 values = values [ mask , : ]37 r a d i i = r a d i i [ mask ]38 d i s t s = c d i s t ( points , t a r g e t s )39 aux = np . where ( d i s t s < 5 , f i l t e r t y p e ( d i s t s , r a d i i ) , 0 )40 del d i s t s41 # import pdb ; pdb . s e t t r a c e ( )42 r e s u l t = np . array ( [ values [ np . argmax ( aux , a x i s =0) , i ] ∗ np . max( aux ,

a x i s =0) f o r i in range ( values . shape [−1]) ] )43 r e s u l t = np . swapaxes ( r e s u l t , 0 , 1 )44 re turn r e s u l t45

46 def i n t e r p o l a t e ( i n t t y p e , points , values , t a r g e t s ) :47 mask = np . l i n a l g . norm ( points , a x i s =−1) <= 12 .5∗np . s q r t ( 3 )48 points = points [ mask ]49 values = values [ mask , : ]50 # import pdb ; pdb . s e t t r a c e ( )51 points x , points y , p o i n t s z = [ c . f l a t t e n ( ) f o r c in np . s p l i t (

points , 3 , a x i s =1) ]52 t a r g e t s x , t a r g e t s y , t a r g e t s z = [ c . f l a t t e n ( ) f o r c in np . s p l i t (

t a r g e t s , 3 , a x i s =1) ]53 r es = np . s tack ( [ Rbf ( points x , points y , points z , values [ . . . , i ] ,

func t ion= i n t t y p e ) ( t a r g e t s x , t a r g e t s y , t a r g e t s z ) f o r i inrange ( values . shape [−1]) ] , a x i s =−1)

54 re turn r es55

56 GRIDS = [ ’ f a a t r ’ , ’ f a r e p ’ , ’ f a s o l ’ , ’ f a e l e c ’ ]57

58 def grid around ( center , s ize , spacing = 1 . 0 ) :59 s ize ang = ( ( s i z e − 1) / 2 . ) ∗ spacing60 ex min = c e n t e r − s ize ang61 ex max = c e n t e r + s ize ang62

63 x = np . l i n s p a c e ( ex min [ 0 ] , ex max [ 0 ] , s i z e )64 y = np . l i n s p a c e ( ex min [ 1 ] , ex max [ 1 ] , s i z e )65 z = np . l i n s p a c e ( ex min [ 2 ] , ex max [ 2 ] , s i z e )66 re turn np . s tack ( np . meshgrid ( x , y , z , indexing= ’ i j ’ ) , a x i s =−1)67

77

C. Pipeline scripts

68 def get keys ( pdb ) :69 resnums = pdb . getResnums ( )70 chids = pdb . getChids ( )71 names = pdb . getNames ( )72 keys = np . char . add ( np . char .mod( ’%s− ’ ,np . char . r e p l a c e ( np . char . add (

np . char .mod( ’%s ’ , resnums ) , chids ) , ’ ’ , ’− ’ ) ) , names )73 re turn keys74

75 def build images ( pdb path , s ize , interpolat ion mode ) :76 complex = parsePDB ( s t r ( pdb path ) )77 prote in = complex . s e l e c t ( ” not ( resname WER or water ) ” )78 l igand = complex . s e l e c t ( ”resname WER” )79 c e n t e r = ca lcCenter ( l igand . getCoords ( ) )80 moveAtoms ( complex , by=−c e n t e r )81 c e n t e r = ca lcCenter ( complex . s e l e c t ( ”resname WER” ) . getCoords ( ) )82

83 s ize ang = ( ( s i z e − 1) / 2 . ) ∗ (24/( s ize −1) )84 ex min = c e n t e r − s ize ang85 ex max = c e n t e r + s ize ang86

87 p o s i t i o n m a t r i x = grid around ( center , s ize , spacing =24/( s ize −1) )88 r a d i i , charges , energy keys , energ ies = import data ( pdb path )89 coordinates = complex . getCoords ( )90 t r y :91 keys = get keys ( complex )92 except :93 p r i n t ( ’FAILED AT KEYS ’ , pdb path )94 r a i s e95 keys p = get keys ( prote in )96 k e y s l = get keys ( l igand )97

98 key map = d i c t ( zip ( keys , range ( len ( keys ) ) ) )99

100 energy ids = [ i f o r i , x in enumerate ( energy keys ) i f x in key map]

101 energy ids p = [ i f o r i , x in enumerate ( energy keys ) i f x inkey map and x in keys p ]

102 e n e r g y i d s l = [ i f o r i , x in enumerate ( energy keys ) i f x inkey map and x in k e y s l ]

103

104

105 energy coordinates = np . array ( [ coordinates [ key map [ x ] ] f o r x inenergy keys i f x in key map ] )

106 energy coordinates p = np . array ( [ coordinates [ key map [ x ] ] f o r x inenergy keys i f x in key map and x in keys p ] )

107 e n e r g y c o o r d i n a t e s l = np . array ( [ coordinates [ key map [ x ] ] f o r x inenergy keys i f x in key map and x in k e y s l ] )

108

109 r a d i i c = np . array ( [ r a d i i [ x ] f o r x in energy keys i f x in key map] )

110 r a d i i p = np . array ( [ r a d i i [ x ] f o r x in energy keys i f x in key mapand x in keys p ] )

111 r a d i i l = np . array ( [ r a d i i [ x ] f o r x in energy keys i f x in key mapand x in k e y s l ] )

78


112

113 image = {}114 image p = {}115 image l = {}116 f o r f i e l d in GRIDS :117 image [ f i e l d ] = np . zeros (3 ∗ ( s ize , ) )118 image p [ f i e l d ] = np . zeros (3 ∗ ( s ize , ) )119 image l [ f i e l d ] = np . zeros (3 ∗ ( s ize , ) )120 #image [ k ] = a p p l y f i l t e r ( f i l t e r mode , energy coordinates , energ ies

[ energy ids , i ] , p o s i t i o n m a t r i x . reshape ((−1 ,3) ) , r a d i i c ) . reshape( 3∗ ( s ize , ) + (−1 , ) )

121 r o s e t t a p = a p p l y f i l t e r ( f i l t e r mode , energy coordinates p ,energ ies [ energy ids p , : ] , p o s i t i o n m a t r i x . reshape ((−1 ,3) ) ,r a d i i p ) . reshape ( 3∗ ( s ize , ) + (−1 ,) )

122 r o s e t t a l = a p p l y f i l t e r ( f i l t e r mode , e n e r g y c o o r d i n a t e s l ,energ ies [ e n e r g y i d s l , : ] , p o s i t i o n m a t r i x . reshape ((−1 ,3) ) ,r a d i i l ) . reshape ( 3∗ ( s ize , ) + (−1 ,) )

123 # r o s e t t a p = i n t e r p o l a t e ( interpolat ion mode , energy coordinates p ,energ ies [ energy ids p , : ] , p o s i t i o n m a t r i x . reshape ((−1 ,3) ) ) .

reshape ( 3∗ ( s ize , ) + (−1 ,) )124 # r o s e t t a l = i n t e r p o l a t e ( interpolat ion mode , e n e r g y c o o r d i n a t e s l ,

energ ies [ e n e r g y i d s l , : ] , p o s i t i o n m a t r i x . reshape ((−1 ,3) ) ) .reshape ( 3∗ ( s ize , ) + (−1 ,) )

125 channel max = np . array ([−1.97975219 e−02, 4 .11150004 e +02 ,4 .01129569 e +00 , 1 .28136470 e +00 ,−0.04185846 , 0 .59311066 ,2 .67294434 , 0 . 4 0 0 7 2 5 2 1 ] )

126 channel min = np . array ([−2.27475644 , 0 . , −0.37756078 ,−3.79981632 , −1.7244696 , 0 . , −0.44943017 , −2.00621753])

127 r o s e t t a a t r p = np . c l i p ( r o s e t t a p [ . . . , 0 ] , channel min [ 0 ] , 0 ) /channel min [ 0 ]

128 r o s e t t a r e p p = np . c l i p ( r o s e t t a p [ . . . , 1 ] , 0 , channel max [ 1 ] ) /channel max [ 1 ]

129 r o s e t t a s o l p p o s = np . c l i p ( r o s e t t a p [ . . . , 2 ] , 0 , channel max [ 2 ] ) /channel max [ 2 ]

130 r o s e t t a e l e c p p o s = np . c l i p ( r o s e t t a p [ . . . , 3 ] , 0 , channel max [ 3 ] ) /channel max [ 3 ]

131 r o s e t t a s o l p n e g = np . c l i p ( r o s e t t a p [ . . . , 2 ] , channel min [ 2 ] , 0 )/channel min [ 2 ]

132 r o s e t t a e l e c p n e g = np . c l i p ( r o s e t t a p [ . . . , 3 ] , channel min [ 3 ] , 0 )/channel min [ 3 ]

133 r o s e t t a a t r l = np . c l i p ( r o s e t t a l [ . . . , 0 ] , channel min [ 4 ] , 0 )/channel min [ 4 ]

134 r o s e t t a r e l l = np . c l i p ( r o s e t t a l [ . . . , 1 ] , 0 , channel max [ 5 ] ) /channel max [ 5 ]

135 r o s e t t a s o l l p o s = np . c l i p ( r o s e t t a l [ . . . , 2 ] , 0 , channel max [ 6 ] ) /channel max [ 6 ]

136 r o s e t t a e l e c l p o s = np . c l i p ( r o s e t t a l [ . . . , 3 ] , 0 , channel max [ 7 ] ) /channel max [ 7 ]

137 r o s e t t a s o l l n e g = np . c l i p ( r o s e t t a l [ . . . , 2 ] , channel min [ 6 ] , 0 )/channel min [ 6 ]

138 r o s e t t a e l e c l n e g = np . c l i p ( r o s e t t a l [ . . . , 3 ] , channel min [ 7 ] , 0 )/channel min [ 7 ]

139 r o s e t t a = np . s tack ( ( r o s e t t a a t r p , r o s e t t a r e p p ,r o s e t t a s o l p p o s , r o s e t t a e l e c p p o s , r o s e t t a s o l p n e g ,

79

C. Pipeline scripts

r o s e t t a e l e c p n e g , r o s e t t a a t r l , r o s e t t a r e l l ,r o s e t t a s o l l p o s , r o s e t t a e l e c l p o s , r o s e t t a s o l l n e g ,r o s e t t a e l e c l n e g ) , a x i s =−1)

140 re turn r o s e t t a141

142 def p r o c e s s f i l e ( f i l e ) :143 saving path = s a v i n g f o l d e r / ( f i l e . name . r e p l a c e ( ’ . pdb ’ , ’ . hdf5 ’ ) )144 p r i n t ( saving path )145 i f saving path . e x i s t s ( ) :146 re turn True147 t r y :148 image = build images ( f i l e , s ize , interpolat ion mode )149 except Exception as e :150 p r i n t ( f i l e )151 p r i n t ( e )152 re turn Fa l se153 save gr id ( saving path , image )154 re turn True155

156 def exp 12 ( r , rvdw ) :157 rvdw = rvdw . reshape ((−1 , ) )158 r r = rvdw [ : , None]/ r159 r e t = np . where ( r ==0 , 1 , 1 − np . exp(−( r r ) ∗∗12) )160 re turn r e t161

162 def gaussian ( r , rvdw ) :163 rvdw = rvdw . reshape ((−1 , ) )164 r r = r/rvdw [ : , None ]165 r e t = np . exp(−( r r ) ∗∗2)166 re turn r e t167

168 i f name == ” main ” :169 s i z e = 25170 # interpolat ion mode = ” t h i n p l a t e ”171 f i l t e r m o d e = exp 12172 root= # Set root path f o r complex f o l d e r s173 s a v i n g f o l d e r = # Path f o r HDF5 storage174 s a v i n g f o l d e r . mkdir ( parents=True , e x i s t o k =True )175 p = Pool ( cpu count ( ) )176 p . map( p r o c e s s f i l e , g e t f i l e s ( root ) )

Listing C.16: thesis/appendix/make electroneg features.py1 from operator import i t e m g e t t e r2 from prody import parsePDB , calcCenter , parseDCD , moveAtoms3 import i t e r t o o l s4 from apbs import APBS5 from c o l l e c t i o n s import d e f a u l t d i c t6 from mult iprocess ing import Pool7 from p a t h l i b import Path8 import numpy as np9 from c o l l e c t i o n s import d e f a u l t d i c t

10 import h5py11 from scipy . i n t e r p o l a t e import Rbf

80


12 from scipy . s p a t i a l . d i s t a n c e import c d i s t13 from mendeleev import element14


19 def g e t f i l e s ( f o l d e r ) :20 re turn f o l d e r . glob ( ’∗/∗ complex ∗ . pdb ’ )21

22 def import data ( pdb path ) :23 data path = s t r ( pdb path ) . r e p l a c e ( ’ . pdb ’ , ’ . a t t r . npz ’ )24 data = np . load ( data path )25 r a d i i = d i c t ( zip ( data [ ’ r c keys ’ ] , data [ ’ r ad i us va lues ’ ] ) )26 charges = d i c t ( zip ( data [ ’ r c keys ’ ] , data [ ’ charge values ’ ] ) )27 energ ies = data [ ’ energy values ’ ] . squeeze ( )28 energy keys = data [ ’ energy keys ’ ]29 re turn r a d i i , charges , energy keys , energ ies30

31 from scipy . s p a t i a l import KDTree32 def a p p l y f i l t e r ( f i l t e r t y p e , points , values , t a r g e t s , r a d i i ) :33 c = 034 mask = np . l i n a l g . norm ( points , a x i s =−1) <= 12 .5∗np . s q r t ( 3 )35 points = points [ mask ]36 values = values [ mask , : ]37 r a d i i = r a d i i [ mask ]38 d i s t s = c d i s t ( points , t a r g e t s )39 aux = np . where ( d i s t s < 5 , f i l t e r t y p e ( d i s t s , r a d i i ) , 0 )40 del d i s t s41 # import pdb ; pdb . s e t t r a c e ( )42 r e s u l t = values [ np . argmax ( aux , a x i s =0) , 0 ] ∗ np . max( aux , a x i s =0)43 re turn r e s u l t44

45 def i n t e r p o l a t e ( i n t t y p e , points , values , t a r g e t s ) :46 mask = np . l i n a l g . norm ( points , a x i s =−1) <= 12 .5∗np . s q r t ( 3 )47 points = points [ mask ]48 values = values [ mask , : ]49 # import pdb ; pdb . s e t t r a c e ( )50 points x , points y , p o i n t s z = [ c . f l a t t e n ( ) f o r c in np . s p l i t (

points , 3 , a x i s =1) ]51 t a r g e t s x , t a r g e t s y , t a r g e t s z = [ c . f l a t t e n ( ) f o r c in np . s p l i t (

t a r g e t s , 3 , a x i s =1) ]52 r es = Rbf ( points x , points y , points z , values , funct ion= i n t t y p e )

( t a r g e t s x , t a r g e t s y , t a r g e t s z )53 re turn r es54 def grid around ( center , s ize , spacing = 1 . 0 ) :55 s ize ang = ( ( s i z e − 1) / 2 . ) ∗ spacing56 ex min = c e n t e r − s ize ang57 ex max = c e n t e r + s ize ang58

59 x = np . l i n s p a c e ( ex min [ 0 ] , ex max [ 0 ] , s i z e )60 y = np . l i n s p a c e ( ex min [ 1 ] , ex max [ 1 ] , s i z e )61 z = np . l i n s p a c e ( ex min [ 2 ] , ex max [ 2 ] , s i z e )62 re turn np . s tack ( np . meshgrid ( x , y , z , indexing= ’ i j ’ ) , a x i s =−1)

81

C. Pipeline scripts

63

64 def get keys ( pdb ) :65 resnums = pdb . getResnums ( )66 chids = pdb . getChids ( )67 names = pdb . getNames ( )68 keys = np . char . add ( np . char .mod( ’%s− ’ ,np . char . r e p l a c e ( np . char . add (

np . char .mod( ’%s ’ , resnums ) , chids ) , ’ ’ , ’− ’ ) ) , names )69 re turn keys70

71

72 def build images ( pdb path , s ize , interpolat ion mode ) :73 complex = parsePDB ( s t r ( pdb path ) )74 prote in = complex . s e l e c t ( ” not ( resname WER or water ) ” )75 l igand = complex . s e l e c t ( ”resname WER” )76 c e n t e r = ca lcCenter ( l igand . getCoords ( ) )77 moveAtoms ( complex , by=−c e n t e r )78 c e n t e r = ca lcCenter ( complex . s e l e c t ( ”resname WER” ) . getCoords ( ) )79

80 s ize ang = ( ( s i z e − 1) / 2 . ) ∗ (24/( s ize −1) )81 ex min = c e n t e r − s ize ang82 ex max = c e n t e r + s ize ang83

84 p o s i t i o n m a t r i x = grid around ( center , s ize , spacing =24/( s ize −1) )85 r a d i i , charges , energy keys , energ ies = import data ( pdb path )86 coordinates = complex . getCoords ( )87 keys = get keys ( complex )88 keys p = get keys ( prote in )89 k e y s l = get keys ( l igand )90

91 e x i s t i n g e l s = s e t ( complex . getElements ( ) )92 e l d i c t = {name . c a p i t a l i z e ( ) : element (name . c a p i t a l i z e ( ) ) . en pauling

f o r name in e x i s t i n g e l s }93 p r i n t ( e l d i c t )94 names p = prote in . getElements ( )95 names l = l igand . getElements ( )96 e l e c p = [ e l d i c t [name . c a p i t a l i z e ( ) ] f o r name in names p ]97 e l e c l = [ e l d i c t [name . c a p i t a l i z e ( ) ] f o r name in names l ]98 e l e c p = [ ( e lec , coord , r a d i i [ key ] ) f o r e lec , coord , key in zip (

e lec p , prote in . getCoords ( ) , keys p ) i f key in energy keys ]99 e l e c l = [ ( e lec , coord , r a d i i [ key ] ) f o r e lec , coord , key in zip (

e l e c l , l igand . getCoords ( ) , k e y s l ) i f key in energy keys ]100 key map = d i c t ( zip ( keys , range ( len ( keys ) ) ) )101

102 e lec p , coord p , r a d i i p = zip (∗ e l e c p )103 e l e c l , coord l , r a d i i l = zip (∗ e l e c l )104 e l e c p = np . array ( e l e c p ) . reshape ((−1 ,1) )105 coord p = np . array ( coord p ) . reshape ((−1 ,3) )106 r a d i i p = np . array ( r a d i i p ) . reshape ((−1 ,1) )107 e l e c l = np . array ( e l e c l ) . reshape ((−1 ,1) )108 c o o r d l = np . array ( c o or d l ) . reshape ((−1 ,3) )109 r a d i i l = np . array ( r a d i i l ) . reshape ((−1 ,1) )110

111 #image [ k ] = a p p l y f i l t e r ( f i l t e r mode , energy coordinates , energ ies[ energy ids , i ] , p o s i t i o n m a t r i x . reshape ((−1 ,3) ) , r a d i i c ) . reshape

82


( 3∗ ( s ize , ) + (−1 , ) )112 map p = a p p l y f i l t e r ( f i l t e r mode , coord p , e lec p , p o s i t i o n m a t r i x

. reshape ((−1 ,3) ) , r a d i i p ) . reshape ( 3∗ ( s ize , ) + (−1 ,) )113 map l = a p p l y f i l t e r ( f i l t e r mode , coord l , e l e c l , p o s i t i o n m a t r i x

. reshape ((−1 ,3) ) , r a d i i l ) . reshape ( 3∗ ( s ize , ) + (−1 ,) )114 #map p = i n t e r p o l a t e ( interpolat ion mode , coord p , e lec p ,

p o s i t i o n m a t r i x . reshape ((−1 ,3) ) ) . reshape ( 3∗ ( s ize , ) + (−1 ,) )115 #map l = i n t e r p o l a t e ( interpolat ion mode , coord l , e l e c l ,

p o s i t i o n m a t r i x . reshape ((−1 ,3) ) ) . reshape ( 3∗ ( s ize , ) + (−1 ,) )116 map p = map p / 3 . 4 4117 map l = map l / 3 . 9 8118 re turn np . concatenate ( ( map p , map l ) , a x i s =−1)119

120 def p r o c e s s f i l e ( f i l e ) :121 saving path = s a v i n g f o l d e r / ( f i l e . name . r e p l a c e ( ’ . pdb ’ , ’ . hdf5 ’ ) )122 i f saving path . e x i s t s ( ) :123 re turn True124 t r y :125 image = build images ( f i l e , s ize , interpolat ion mode )126 except Exception as e :127 p r i n t ( f i l e )128 p r i n t ( e )129 re turn Fa l se130 save gr id ( saving path , image )131 re turn True132

133 def exp 12 ( r , rvdw ) :134 rvdw = rvdw . reshape ((−1 , ) )135 r r = rvdw [ : , None]/ r136 r e t = np . where ( r ==0 , 1 , 1 − np . exp(−( r r ) ∗∗12) )137 re turn r e t138

139 def gaussian ( r , rvdw ) :140 rvdw = rvdw . reshape ((−1 , ) )141 r r = r/rvdw [ : , None ]142 r e t = np . exp(−( r r ) ∗∗2)143 re turn r e t144

145 i f name == ” main ” :146 s i z e = 25147 # interpolat ion mode = ” t h i n p l a t e ”148 f i l t e r m o d e = exp 12149 root = # Set root path f o r complex f o l d e r s150 s a v i n g f o l d e r = # Saving path f o r HDF5151 s a v i n g f o l d e r . mkdir ( parents=True , e x i s t o k =True )152 p = Pool ( 4 8 )153 p . map( p r o c e s s f i l e , g e t f i l e s ( root ) )

Listing C.17: thesis/appendix/make apbs features.py1 from operator import i t e m g e t t e r2 from prody import parsePDB , calcCenter , parseDCD , moveAtoms3 import i t e r t o o l s4 from apbs import APBS

83

C. Pipeline scripts

5 from c o l l e c t i o n s import d e f a u l t d i c t6 from mult iprocess ing import Pool7 from p a t h l i b import Path8 import numpy as np9 from c o l l e c t i o n s import d e f a u l t d i c t

10 import h5py11


16 def g e t f i l e s ( f o l d e r ) :17 re turn f o l d e r . glob ( ’∗/∗ complex ∗ . pdb ’ )18

19 def import data ( pdb path ) :20 data path = s t r ( pdb path ) . r e p l a c e ( ’ . pdb ’ , ’ . a t t r . npz ’ )21 data = np . load ( data path )22 r a d i i = d i c t ( zip ( data [ ’ r c keys ’ ] , data [ ’ r ad i us va lues ’ ] ) )23 charges = d i c t ( zip ( data [ ’ r c keys ’ ] , data [ ’ charge values ’ ] ) )24 re turn r a d i i , charges25

26 def build images ( pdb path , s i z e ) :27 complex = parsePDB ( s t r ( pdb path ) )28 l igand = complex . s e l e c t ( ”resname WER” )29 c e n t e r = ca lcCenter ( l igand . getCoords ( ) )30 moveAtoms ( complex , by=−c e n t e r )31 r a d i i , charges = import data ( pdb path )32 r es = complex . ge tRes indices ( ) + 133 resnums = complex . getResnums ( )34 chids = complex . getChids ( )35 names = complex . getNames ( )36 atoms = np . char . add ( np . char .mod( ’%s− ’ ,np . char . r e p l a c e ( np . char . add (

np . char .mod( ’%s ’ , resnums ) , chids ) , ’ ’ , ’− ’ ) ) , names )37 t r y :38 atom charges = np . array ( i t e m g e t t e r (∗ atoms ) ( d e f a u l t d i c t ( f l o a t ,

charges ) ) )39 atom radi i = np . array ( i t e m g e t t e r (∗ atoms ) ( d e f a u l t d i c t ( f l o a t , r a d i i

) ) )40 except KeyError as e :41 p r i n t ( ’ Error in prote in : ’ + s t r ( pdb path ) )42 r a i s e e43 complex . setCharges ( atom charges )44 complex . s e t R a d i i ( a tom radi i )45 p r o t e i n p o t e n t i a l , l i g a n d p o t e n t i a l , complex potent ia l =

c o m p u t e e l e c t r o p o t e n t i a l ( pdb path , complex , s i z e )46 grid = np . s tack ( ( p r o t e i n p o t e n t i a l , l i g a n d p o t e n t i a l ,

complex potent ia l ) , a x i s =−1)47 re turn grid48

49 def c o m p u t e e l e c t r o p o t e n t i a l ( pdb path , complex , s i z e ) :50 path = pdb path . parent51 c e n t e r = np . a r r a y 2 s t r i n g ( ca lcCenter ( complex . s e l e c t ( ’ resname WER’ ) .

getCoords ( ) ) ) [1 :−1]52 grid dim = ’ ’ . j o i n (3 ∗ [ s t r ( s i z e ) ] )

84


53 gr id space = ’ ’ . j o i n (3 ∗ [ s t r ( 1 ) ] )54 cglen = ’ ’ . j o i n (3 ∗ [ s t r ( s i z e + 10) ] )55 f g l e n = ’ ’ . j o i n (3 ∗ [ s t r ( s i z e − 1) ] )56

57 p r o t e i n p o t e n t i a l = APBS . run ( path , pdb path . stem+ ’ p r o t e i n ’ ,complex ,

58 ’ not resname WER’ ,59 grid dim , gr id space , center , cglen ,

f g l e n )60 l i g a n d p o t e n t i a l = APBS . run ( path , pdb path . stem+ ’ l i g a n d ’ ,

complex ,61 ’ resname WER’ ,62 grid dim , gr id space , center , cglen ,

f g l e n )63 complex potent ia l = APBS . run ( path , pdb path . stem+ ’ complex ’ ,

complex ,64 ’ a l l ’ ,65 grid dim , gr id space , center , cglen ,

f g l e n )66

67 re turn p r o t e i n p o t e n t i a l , l i g a n d p o t e n t i a l , complex potent ia l68

69 def p r o c e s s f i l e ( f i l e ) :70 saving path = s a v i n g f o l d e r / ( f i l e . name . r e p l a c e ( ’ . pdb ’ , ’ . hdf5 ’ ) )71 p r i n t ( f i l e )72 i f saving path . e x i s t s ( ) :73 re turn True74 t r y :75 image = build images ( f i l e , s i z e )76 except Exception as e :77 p r i n t ( e )78 re turn Fa l se79 save gr id ( saving path , image )80 re turn True81

82 i f name == ” main ” :83 s i z e = 2584 root = # Set root path f o r complex f o l d e r s85 s a v i n g f o l d e r = # Saving path f o r HDF586 s a v i n g f o l d e r . mkdir ( parents=True , e x i s t o k =True )87 p = Pool ( 2 4 )88 f i l e s = f i l t e r ( lambda x : not ( s a v i n g f o l d e r /x . name . r e p l a c e ( ’ . pdb ’ ,

’ . hdf5 ’ ) ) . e x i s t s ( ) , g e t f i l e s ( root ) )89 p . map( p r o c e s s f i l e , f i l e s )

Listing C.18: thesis/appendix/apbs.in1 # READ IN MOLECULES2 read3 mol pqr XXX. pqr4 end5

6

7 e l e c # E l e c t r o s t a t i c s c a l c u l a t i o n on the solvated s t a t e

85

C. Pipeline scripts

8 mg−auto # Spec i fy the mode f o r APBS to run9 dime GRID DIM # The grid dimensions

10 grid GRID SPACE # Grid spacing11 gcent INH CENTER # Center the grid12 cglen CG LEN13 cgcent INH CENTER14 f g l e n FG LEN15 f gc e n t INH CENTER16 mol 1 # Perform the c a l c u l a t i o n on molecule 117 lpbe # Solve the l i n e a r i z e d Poisson−Boltzmann18 # equation19 b c f l mdh # Use a l l mult ipole moments when

c a l c u l a t i n g the20 # p o t e n t i a l21 ion 1 0 .150 2 . 022 ion −1 0 .150 2 . 023 pdie 1 . 0 # Solute d i e l e c t r i c24 sdie 78 .54 # Solvent d i e l e c t r i c25 chgm spl 2 # Spline−based d i s c r e t i z a t i o n of the d e l t a26 # f u n c t i o n s27 srfm smol # Molecular s u r f a c e d e f i n i t i o n28 srad 1 . 4 # Solvent probe radius ( f o r molecular

s u r f a c e )29 swin 0 . 3 # Solvent s u r f a c e s p l i n e window ( not used

here )30 sdens 1 0 . 0 # Sphere densi ty of a c c e s s i b i l i t y o b j e c t31 temp 298 .15 # Temperature32 ca lcenergy no # C al c u la t e energ ies33 c a l c f o r c e no # Do not c a l c u l a t e f o r c e s34 write pot dx XXX p o t e n t i a l # Write out the p o t e n t i a l35 end36 qui t

Listing C.19: thesis/appendix/apbs.py1 from p a t h l i b import Path2 from subprocess import c a l l3 import numpy as np4 from prody import writePQR5 from u t i l s import i s f l o a t6

7

8 c l a s s APBS :9 APBS BIN PATH = ’ apbs ’

10 TEMPLATE FILE = ’ apbs . in ’11 @staticmethod12 def run (13 output path ,14 name ,15 pose ,16 s e l e c t i o n ,17 grid dim ,18 grid space ,19 center ,

86


20 cglen ,21 f g l e n ) :22 apbs bin path = APBS . APBS BIN PATH23 a p b s t e m p l a t e f i l e = Path (APBS . TEMPLATE FILE )24 a p b s i n p u t f i l e = output path / f ’ apbs {name} . in ’25 a p b s o u t p u t f i l e = output path / f ’{name} p o t e n t i a l . dx ’26 i f a p b s i n p u t f i l e . e x i s t s ( ) :27 a p b s i n p u t f i l e . unlink ( )28 i f a p b s o u t p u t f i l e . e x i s t s ( ) :29 a p b s o u t p u t f i l e . unlink ( )30 writePQR ( f ’{output path/name} . pqr ’ , pose . s e l e c t ( s e l e c t i o n ) )31 with a p b s t e m p l a t e f i l e . open ( ’ r ’ ) as f :32 f i l e d a t a = f . read ( )33 f i l e d a t a = APBS . r e p l a c e a p b s (34 f i l e d a t a , s t r ( output path/name) , grid dim , gr id space ,

center , cglen , f g l e n )35 with a p b s i n p u t f i l e . open ( ’w’ ) as f :36 f . wri te ( f i l e d a t a )37 c a l l ( [ apbs bin path ,38 f ’{ a p b s i n p u t f i l e . abso lute ( ) } ’ ] ,39 cwd= s t r ( output path ) )40 a p b s i n p u t f i l e . unlink ( )41 o , d , p o t e n t i a l = APBS . import dx ( a p b s o u t p u t f i l e )42 a p b s o u t p u t f i l e . unlink ( )43 re turn p o t e n t i a l44

45 @staticmethod46 def r e p l a c e a p b s (47 f i l e d a t a ,48 xxx ,49 grid dim ,50 grid space ,51 center ,52 cglen ,53 f g l e n ) :54 f i l e d a t a = f i l e d a t a \55 . r e p l a c e ( ’XXX ’ , xxx ) \56 . r e p l a c e ( ’GRID DIM ’ , grid dim ) \57 . r e p l a c e ( ’GRID SPACE ’ , gr id space ) \58 . r e p l a c e ( ’INH CENTER ’ , c e n t e r ) \59 . r e p l a c e ( ’CG LEN ’ , cglen ) \60 . r e p l a c e ( ’FG LEN ’ , f g l e n )61 re turn f i l e d a t a62

63 @staticmethod64 def import dx ( f i lename ) :65 o r i g i n = d e l t a = data = dims = None66 counter = 067 with open ( fi lename , ’ r ’ ) as d x f i l e :68 f o r row in d x f i l e :69 row = row . s t r i p ( ) . s p l i t ( )70 i f not row :71 continue72 i f row [ 0 ] == ’ # ’ :

87

C. Pipeline scripts

73 continue74 e l i f row [ 0 ] == ’ o r i g i n ’ :75 o r i g i n = np . array ( row [ 1 : ] , dtype= f l o a t )76 e l i f row [ 0 ] == ’ d e l t a ’ :77 d e l t a = np . array ( row [ 2 : ] , dtype= f l o a t )78 e l i f row [ 0 ] == ’ o b j e c t ’ :79 i f row [ 1 ] == ’ 1 ’ :80 dims = np . array ( row [ −3 : ] , dtype= i n t )81 data = np . empty ( np . prod ( dims ) )82 e l i f i s f l o a t ( row [ 0 ] ) :83 data [3 ∗ counter : min (3 ∗ ( counter + 1) , len ( data

) )84 ] = np . array ( row , dtype= f l o a t )85 counter += 186 data = data . reshape ( dims )87 re turn or ig in , del ta , data88

89 @staticmethod90 def export dx ( fi lename , density , or ig in , d e l t a ) :91 nx , ny , nz = densi ty . shape92 with open ( fi lename , ’w’ ) as d x f i l e :93 d x f i l e . wri te (94 f ’ o b j e c t 1 c l a s s g r i d p o s i t i o n s counts {nx} {ny} {nz

}\n ’ )95 d x f i l e . wri te ( f ’ o r i g i n { o r i g i n [ 0 ]} { o r i g i n [ 1 ]} { o r i g i n

[ 2 ]}\n ’ )96 d x f i l e . wri te ( f ’ d e l t a {d e l t a } 0 . 0 0 . 0\n ’ )97 d x f i l e . wri te ( f ’ d e l t a 0 . 0 {d e l t a } 0 . 0\n ’ )98 d x f i l e . wri te ( f ’ d e l t a 0 . 0 0 . 0 {d e l t a }\n ’ )99 d x f i l e . wri te (

100 f ’ o b j e c t 2 c l a s s gr idconnect ions counts {nx} , {ny} ,{nz}\n ’ )

101 d x f i l e . wri te (102 f ’ o b j e c t 3 c l a s s array type double rank 0 items {nx

∗ ny ∗ nz} data fo l lows \n ’ )103 i = 1104 f o r d in densi ty . f l a t t e n ( order= ’C ’ ) :105 i f i % 3 :106 d x f i l e . wri te ( ’ {} ’ . format ( d ) )107 e l s e :108 d x f i l e . wri te ( ’ {}\n ’ . format ( d ) )109 i += 1110

111 d x f i l e . wri te ( ’\n ’ )

Listing C.20: thesis/appendix/make supermap.py1 import h5py2 import numpy as np3 import os4 from p a t h l i b import Path5 from mult iprocess ing import Pool6

7 s u f f i x = ””

88


8

9 def g e t f i l e s ( ) :10 r o s e t t a = [ x f o r x in os . l i s t d i r ( r o s e t t a p a t h ) i f ’ . hdf5 ’ in x ]11 htmd = [ x f o r x in os . l i s t d i r ( htmd path ) i f ’ . hdf5 ’ in x ]12 apbs = [ x f o r x in os . l i s t d i r ( apbs path ) i f ’ . hdf5 ’ in x ]13 e l e c = [ x f o r x in os . l i s t d i r ( e l e c t r o n e g p a t h ) i f ’ . hdf5 ’ in x ]14 f i l e s = s e t . i n t e r s e c t i o n ( s e t ( r o s e t t a ) , s e t ( htmd ) , s e t ( e l e c ) )15 re turn f i l e s16

17 def combine maps ( f i l e ) :18 r o s e t t a = os . path . j o i n ( r o s e t t a p a t h , f i l e )19 htmd = os . path . j o i n ( htmd path , f i l e )20 apbs = os . path . j o i n ( apbs path , f i l e )21 e l e c = os . path . j o i n ( e lec t roneg path , f i l e )22 output = os . path . j o i n ( output path , f i l e )23 i f Path ( output ) . e x i s t s ( ) :24 re turn25 t r y :26 with h5py . F i l e ( r o s e t t a , ’ r ’ ) as f :27 r o s e t t a g r i d = np . array ( f [ ’ gr id ’ ] )28 except :29 p r i n t ( ” Error r o s e t t a ” , f i l e )30 re turn31 t r y :32 with h5py . F i l e ( htmd , ’ r ’ ) as f :33 htmd grid = np . array ( f [ ’ gr id ’ ] )34 except :35 p r i n t ( ” Error htmd ” , f i l e )36 re turn37 t r y :38 with h5py . F i l e ( e lec , ’ r ’ ) as f :39 e l e c g r i d = np . array ( f [ ’ gr id ’ ] )40 except :41 p r i n t ( ” Error e l e c ” , f i l e )42 re turn43 t r y :44 with h5py . F i l e ( apbs , ’ r ’ ) as f :45 apbs grid = np . array ( f [ ’ gr id ’ ] )46 except :47 p r i n t ( ” Error apbs ” , f i l e )48 re turn49 grid = np . concatenate ( ( htmd grid , e l e c g r i d , r o s e t t a g r i d ) , a x i s =−1)50 p r i n t ( gr id . shape )51 with h5py . F i l e ( output , ’w’ , l i b v e r = ’ l a t e s t ’ ) as f :52 f . c r e a t e d a t a s e t ( ” grid ” , dtype= ’ f4 ’ , data=grid )53

54 i f name == ” main ” :55 output path = # Set HDF5 output f o l d e r56 r o s e t t a p a t h = # Set HDF5 input f o l d e r57 htmd path = # Set HDF5 input f o l d e r58 e l e c t r o n e g p a t h = # Set HDF5 input f o l d e r59 apbs path = # Set HDF5 input f o l d e r60 Path ( output path ) . mkdir ( parents=True , e x i s t o k =True )61 p = Pool ( 4 8 )

89

C. Pipeline scripts

62 p . map( combine maps , g e t f i l e s ( ) )

Listing C.21: thesis/appendix/make tfrecords.py1 import os2 os . environ [ ”CUDA VISIBLE DEVICES” ]= ””3 import tensorf low as t f4 import h5py5 import numpy as np6 from p a t h l i b import Path7 import pandas as pd8 import s h u t i l9

10

11 def g e t p d b c o r e s e t l i s t ( ) :12 pdb core = # Path to core pdb l i s t f i l e13 with pdb core . open ( ’ r ’ ) as c o r e l i s t f i l e :14 pdb l ines = [ x . r e p l a c e ( ’\n ’ , ’ ’ ) f o r x in c o r e l i s t f i l e .

r e a d l i n e s ( ) ]15 re turn s e t ( pdb l ines )16

17 def g e t p d b l a b e l s ( source ) :18 i f source == ’PDBBind ’ :19 c s v f i l e = ’ INDEX refined data .2018 ’20 csv = pd . read csv ( c s v f i l e ,21 comment= ’ # ’ ,22 delim whitespace=True ,23 header=None ,24 useco l s = [ 0 , 3 ] )25 re turn d i c t ( zip ( csv [ 0 ] , csv [ 3 ] ) )26

27 def g e t f i l e l i s t ( path ) :28 re turn l i s t ( f o l d e r . glob ( ’ ∗ . hdf5 ’ ) )29

30 def s p l i t d a t a s e t ( f i l e s , t r a i n p e r c e n t a g e ) :31 n = len ( f i l e s )32 indexes = np . arange ( n )33 s h u f f l e d i n d e x e s = np . random . permutation ( indexes )34

35 percentages = np . arange ( n , dtype= f l o a t ) / n36 t r a i n i n d e x e s = s h u f f l e d i n d e x e s [ percentages <= t r a i n p e r c e n t a g e ]37 t e s t i n d e x e s = s h u f f l e d i n d e x e s [ percentages > t r a i n p e r c e n t a g e ]38 t r a i n f i l e s = [ f i l e s [ i ] f o r i in t r a i n i n d e x e s ]39 t e s t f i l e s = [ f i l e s [ i ] f o r i in t e s t i n d e x e s ]40

41 re turn t r a i n f i l e s , t e s t f i l e s42

43

44 def g e n e r a t e t f r e c o r d ( f i l e s , o u t p u t f i l e ) :45 pdb labe ls = g e t p d b l a b e l s ( ’PDBBind ’ )46 with t f . python io . TFRecordWriter ( o u t p u t f i l e ) as w r i te r :47 f o r f i l e in f i l e s :48 pdb code = f i l e . stem [ : 4 ]49 with h5py . F i l e ( s t r ( f i l e ) ) as h d f 5 f i l e :

90


50 datapoint = np . array ( h d f 5 f i l e [ ’ gr id ’ ] , dtype=np . f l o a t 3 2 )51 X = datapoint . f l a t t e n ( )52 y = np . array ( [ pdb labe ls [ pdb code ] ] )53 example = t f . t r a i n . Example ( f e a t u r e s = t f . t r a i n . Features (

f e a t u r e ={54 ’X ’ : t f . t r a i n . Feature ( f l o a t l i s t = t f . t r a i n . F l o a t L i s t ( value=

X) ) ,55 ’ y ’ : t f . t r a i n . Feature ( f l o a t l i s t = t f . t r a i n . F l o a t L i s t ( value=

y ) )56 } ) )57 w ri t e r . wri te ( example . S e r i a l i z e T o S t r i n g ( ) )58 f i l e . unlink ( )59 p r i n t ( ’Done wri t ing {} ’ . format ( o u t p u t f i l e ) )60

61 def chunk by size ( f i l e s , recommended tf size= f l o a t ( 1 0 0∗ ( 2∗∗2 0 ) ) ) :62 a v e r a g e s i z e = sum(map( lambda x : x . s t a t ( ) . s t s i z e , f i l e s ) ) / f l o a t

( len ( f i l e s ) )63 f i l e s p e r c h u n k = np . c e i l ( recommended tf size / a v e r a g e s i z e )64 chunks = i n t ( np . c e i l ( f l o a t ( len ( f i l e s ) ) / f i l e s p e r c h u n k ) )65 re turn np . a r r a y s p l i t ( np . array ( f i l e s ) , chunks )66

67 def h d f 5 t o t f r e c o r d s ( fo lder , s p l i t ) :68 f i l e s = g e t f i l e l i s t ( f o l d e r )69 t r a i n f i l e s , t e s t f i l e s = s p l i t d a t a s e t ( f i l e s , s p l i t )70 t ra in chunks = chunk by size ( t r a i n f i l e s )71 t e s t chunks = chunk by size ( t e s t f i l e s )72 t r a i n f o l d e r = f o l d e r / ’ t r a i n ’73 t e s t f o l d e r = f o l d e r / ’ t e s t ’74 s h u t i l . rmtree ( t r a i n f o l d e r , i g n o r e e r r o r s =True )75 s h u t i l . rmtree ( t e s t f o l d e r , i g n o r e e r r o r s =True )76 t r a i n f o l d e r . mkdir ( parents=True , e x i s t o k =True )77 t e s t f o l d e r . mkdir ( parents=True , e x i s t o k =True )78 f o r i , chunk in enumerate ( t ra in chunks ) :79 chunk output = t r a i n f o l d e r / ’ t r a i n {} . t f r e c o r d s ’ . format ( i )80 g e n e r a t e t f r e c o r d ( chunk , s t r ( chunk output ) )81 p r i n t ( ’ Generated {} ’ . format ( chunk output ) )82 f o r i , chunk in enumerate ( tes t chunks ) :83 chunk output = t e s t f o l d e r / ’ t e s t {} . t f r e c o r d s ’ . format ( i )84 g e n e r a t e t f r e c o r d ( chunk , s t r ( chunk output ) )85 p r i n t ( ’ Generated {} ’ . format ( chunk output ) )86

87 def s p l i t d a t a s e t r e f i n e d c o r e ( f i l e s ) :88 pdb core = g e t p d b c o r e s e t l i s t ( )89 t r a i n f i l e s = [ f i l e f o r f i l e in f i l e s i f f i l e . stem [ : 4 ] not in

pdb core ]90 t e s t f i l e s = [ f i l e f o r f i l e in f i l e s i f f i l e . stem [ : 4 ] in pdb core ]91 re turn t r a i n f i l e s , t e s t f i l e s92

93 def h d f 5 t o t f r e c o r d s r e f i n e d c o r e ( f o l d e r ) :94 f i l e s = g e t f i l e l i s t ( f o l d e r )95 t r a i n f i l e s , t e s t f i l e s = s p l i t d a t a s e t r e f i n e d c o r e ( f i l e s )96 t ra in chunks = chunk by size ( t r a i n f i l e s )97 t e s t chunks = chunk by size ( t e s t f i l e s )98 t r a i n f o l d e r = f o l d e r / ’ t r a i n ’

91

C. Pipeline scripts

99 t e s t f o l d e r = f o l d e r / ’ t e s t ’100 s h u t i l . rmtree ( t r a i n f o l d e r , i g n o r e e r r o r s =True )101 s h u t i l . rmtree ( t e s t f o l d e r , i g n o r e e r r o r s =True )102 t r a i n f o l d e r . mkdir ( parents=True , e x i s t o k =True )103 t e s t f o l d e r . mkdir ( parents=True , e x i s t o k =True )104 f o r i , chunk in enumerate ( t ra in chunks ) :105 chunk output = t r a i n f o l d e r / ’ t r a i n {} . t f r e c o r d s ’ . format ( i )106 g e n e r a t e t f r e c o r d ( chunk , s t r ( chunk output ) )107 p r i n t ( ’ Generated {} ’ . format ( chunk output ) )108 f o r i , chunk in enumerate ( tes t chunks ) :109 chunk output = t e s t f o l d e r / ’ t e s t {} . t f r e c o r d s ’ . format ( i )110 g e n e r a t e t f r e c o r d ( chunk , s t r ( chunk output ) )111 p r i n t ( ’ Generated {} ’ . format ( chunk output ) )112

113

114 import argparse115 parser = argparse . ArgumentParser ( usage= ’%(prog ) s A ? [ B | C] ’ )116 parser . add argument ( ’−f ’ , ’−−f o l d e r ’ , dest= ’ f o l d e r ’ , d e f a u l t = s t r (

Path . home ( ) ) )117 parser . add argument ( ’−c ’ , ’−−core ’ , dest= ’ core ’ , a c t i o n = ’ s t o r e t r u e ’

)118 parser . add argument ( ’−s ’ , ’−−s p l i t ’ , dest= ’ s p l i t ’ , type= f l o a t )119

120 i f name == ” main ” :121 args = parser . parse args ( )122 f o l d e r = args . f o l d e r123 s p l i t = args . s p l i t124 f o l d e r = Path ( f o l d e r )125 i f args . core :126 h d f 5 t o t f r e c o r d s r e f i n e d c o r e ( f o l d e r )127 e l s e :128 h d f 5 t o t f r e c o r d s ( fo lder , s p l i t )

C.5 PDBQT generation

Listing C.22: thesis/appendix/make pdbqt.sh1 # !/ bin/bash2 root= # Set root path f o r complex f o l d e r s3 pythonsh=”${MGLTOOLS}/pythonsh”4 f ind $root −type d −maxdepth 1 −mindepth 1 | s o r t | p a r a l l e l −n 1

$pythonsh preprocess v ina . py {}

Listing C.23: thesis/appendix/preprocess vina.py1 #Code e x t r a c t e d from MGLTools . AutoDock 4 i s d i s t r i b u t e d under GNU

GPL l i c e n s e . ht tp :// autodock . s c r i p p s . edu/2 import os3 from MolKit import Read4 import MolKit . molecule5 import MolKit . pro te in6 from AutoDockTools . MoleculePreparat ion import AD4ReceptorPreparation

, AD4LigandPreparation

92

C.5. PDBQT generation

7 import sys8 import getopt9

10

11 def p r e p r o c e s s r e c e p t o r ( receptor f i l ename , outputf i lename ) :12 r e p a i r s = ’ ’13 charges to add = ’ g a s t e i g e r ’14 preserve charge types=None15 cleanup = ””16 mode = ” automatic ”17 d e l e t e s i n g l e n o n s t d r e s i d u e s = Fa l se18 d i c t i o n a r y = None19

20 mols = Read ( r e c e p t o r f i l e n a m e )21 mol = mols [ 0 ]22 preserved = {}23 i f charges to add i s not None and preserve charge types i s not

None :24 preserved types = preserve charge types . s p l i t ( ’ , ’ )25 f o r t in preserved types :26 i f not len ( t ) : continue27 a t s = mol . allAtoms . get ( lambda x : x . autodock element== t )28 f o r a in a t s :29 i f a . chargeSet i s not None :30 preserved [ a ] = [ a . chargeSet , a . charge ]31

32 i f len ( mols ) >1:33 c t r = 134 f o r m in mols [ 1 : ] :35 c t r += 136 i f len (m. allAtoms )>len ( mol . allAtoms ) :37 mol = m38 mol . buildBondsByDistance ( )39

40 RPO = AD4ReceptorPreparation ( mol , mode , repa i r s , charges to add ,41 cleanup , outputf i lename=outputfi lename ,42 preserved=preserved ,43 d e l e t e s i n g l e n o n s t d r e s i d u e s =

d e l e t e s i n g l e n o n s t d r e s i d u e s ,44 d i c t = d i c t i o n a r y )45

46 i f charges to add i s not None :47 f o r atom , c h a r g e L i s t in preserved . items ( ) :48 atom . charges [ c h a r g e L i s t [ 0 ] ] = c h a r g e L i s t [ 1 ]49 atom . chargeSet = c h a r g e L i s t [ 0 ]50

51 def preprocess l igand ( l igand f i lename , outputf i lename ) :52 verbose = None53 r e p a i r s = ”” #”hydrogens bonds”54 charges to add = ’ g a s t e i g e r ’55 preserve charge types= ’ ’56 cleanup = ””57 allowed bonds = ”backbone”58 root = ’ auto ’

93

C. Pipeline scripts

59 check for f ragments = True60 b o n d s t o i n a c t i v a t e = ””61 i n a c t i v a t e a l l t o r s i o n s = True62 attach nonbonded fragments = True63 a t t a c h s i n g l e t o n s = True64 mode = ” automatic ”65 d i c t = None66

67 mols = Read ( l igand f i l ename )68 i f verbose : p r i n t ’ read ’ , l igand f i l ename69 mol = mols [ 0 ]70 i f len ( mols ) >1:71 c t r = 172 f o r m in mols [ 1 : ] :73 c t r += 174 i f len (m. allAtoms )>len ( mol . allAtoms ) :75 mol = m76 c o o r d d i c t = {}77 f o r a in mol . allAtoms : c o o r d d i c t [ a ] = a . coords78

79 mol . buildBondsByDistance ( )80 i f charges to add i s not None :81 preserved = {}82 preserved types = preserve charge types . s p l i t ( ’ , ’ )83 f o r t in preserved types :84 i f not len ( t ) : continue85 a t s = mol . allAtoms . get ( lambda x : x . autodock element== t )86 f o r a in a t s :87 i f a . chargeSet i s not None :88 preserved [ a ] = [ a . chargeSet , a . charge ]89

90

91

92 LPO = AD4LigandPreparation ( mol , mode , repa i r s , charges to add ,93 cleanup , allowed bonds , root ,94 outputf i lename=outputfi lename ,95 d i c t =dic t , check for f ragments=

check for fragments ,96 b o n d s t o i n a c t i v a t e = b o n d s t o i n a c t i v a t e ,97 i n a c t i v a t e a l l t o r s i o n s =

i n a c t i v a t e a l l t o r s i o n s ,98 attach nonbonded fragments=

attach nonbonded fragments ,99 a t t a c h s i n g l e t o n s = a t t a c h s i n g l e t o n s )

100 i f charges to add i s not None :101 f o r atom , c h a r g e L i s t in preserved . items ( ) :102 atom . charges [ c h a r g e L i s t [ 0 ] ] = c h a r g e L i s t [ 1 ]103 atom . chargeSet = c h a r g e L i s t [ 0 ]104 b a d l i s t = [ ]105 f o r a in mol . allAtoms :106 i f a in c o o r d d i c t . keys ( ) and a . coords != c o o r d d i c t [ a ] :107 b a d l i s t . append ( a )108 i f len ( b a d l i s t ) :109 p r i n t len ( b a d l i s t ) , ’ atom coordinates changed ! ’

94

C.6. Molecular Dynamics simulation

110 f o r a in b a d l i s t :111 p r i n t a . name , ” : ” , c o o r d d i c t [ a ] , ’ −> ’ , a . coords112 e l s e :113 i f verbose : p r i n t ”No change in atomic coordinates ”114 i f mol . returnCode ! = 0 :115 sys . s t d e r r . wri te ( mol . returnMsg+”\n” )116

117

118 from fnmatch import fnmatch119

120 def g e t f i l e s ( f o l d e r ) :121 re turn ( [ os . path . j o i n ( fo lder , x ) f o r x in os . l i s t d i r ( f o l d e r ) i f

fnmatch ( x , ’∗ p r o t e i n 0 ∗ . pdb ’ ) and not x . endswith ( ’ autopsf . pdb ’ )] ,

122 [ os . path . j o i n ( fo lder , x ) f o r x in os . l i s t d i r ( f o l d e r ) i ffnmatch ( x , ’∗ l i g a n d 0 ∗ . mol2 ’ ) and not x . endswith ( ’ autopsf . mol2 ’) ] )

123

124 def p r o c e s s f o l d e r ( f o l d e r ) :125 receptors , l igands = g e t f i l e s ( f o l d e r )126 r e c e p t o r s = [ r e c e p t o r f o r r e c e p t o r in r e c e p t o r s i f not os . path .

e x i s t s ( r e c e p t o r + ’ qt ’ ) ]127 l igands = [ l igand f o r l igand in l igands i f not os . path . e x i s t s (

l igand . r e p l a c e ( ’ . mol2 ’ , ’ . pdbqt ’ ) ) ]128 f o r r e c e p t o r in r e c e p t o r s :129 t r y :130 p r e p r o c e s s r e c e p t o r ( receptor , r e c e p t o r . r e p l a c e ( ’ . pdb ’ , ’ . pdbqt ’

) )131 except Exception , e :132 p r i n t ’ Prote in ’ , r e c e p t o r133 r a i s e e134 f o r l igand in l igands :135 t r y :136 preprocess l igand ( ligand , l igand . r e p l a c e ( ’ . mol2 ’ , ’ . pdbqt ’ ) )137 except Exception , e :138 p r i n t ’ Ligand ’ , l igand139 r a i s e e140

141 i f name == ” main ” :142 p r o c e s s f o l d e r ( sys . argv [ 1 ] )

C.6 Molecular Dynamics simulation

Listing C.24: thesis/appendix/make protein psf.tcl1 package requi re autopsf2

3 proc gen psf {pdb path} {4 s e t molID [ mol load pdb $pdb path ]5 autopsf −mol $molID −protein −regen6 }7

8 gen psf [ l index $argv 0]

95

C. Pipeline scripts

9 e x i t

Listing C.25: thesis/appendix/molecular dynamics.py1 import argparse2 import os3 import subprocess4 import queue5 import threading6 from p a t h l i b import Path7 from simtk .openmm. app import ∗8 from simtk .openmm import ∗9 from simtk . uni t import ∗

10 from mdtraj . r e p o r t e r s import DCDReporter11

12 def generate snapshots ( pdb path , psf path , dcd path ) :13 psf = CharmmPsfFile ( ps f path )14 pdb = PDBFile ( pdb path )15 params = CharmmParameterSet (16 ’ toppar/stream/misc/toppar amines . s t r ’ ,17 ’ toppar/stream/misc/toppar dum noble gases . r t f ’ ,18 ’ toppar/stream/misc/toppar hbond . s t r ’ ,19 ’ toppar/toppar water ions . s t r ’ ,20 ’ toppar/stream/prot/toppar a l l36 pro t mode l . r t f ’ ,21 ’ toppar/ t o p a l l 2 2 m e t a l s . r t f ’ ,22 ’ toppar/ p a r a l l 2 2 m e t a l s . inp ’ ,23 ’ toppar/ t o p a l l 3 6 c a r b . r t f ’ ,24 ’ toppar/ p a r a l l 3 6 c a r b . prm ’ ,25 ’ toppar/ t o p a l l 3 6 c g e n f f . r t f ’ ,26 ’ toppar/ p a r a l l 3 6 c g e n f f . prm ’ ,27 ’ toppar/ t o p a l l 3 5 e t h e r s . r t f ’ ,28 ’ toppar/ p a r a l l 3 5 e t h e r s . prm ’ ,29 ’ toppar/ t o p a l l 3 6 l i p i d . r t f ’ ,30 ’ toppar/ p a r a l l 3 6 l i p i d . prm ’ ,31 ’ toppar/ t o p a l l 3 6 n a . r t f ’ ,32 ’ toppar/ p a r a l l 3 6 n a . prm ’ ,33 ’ toppar/ t o p a l l 3 6 p r o t . r t f ’ ,34 ’ toppar/ p a r a l l 3 6 p r o t . prm ’ ,35 ’ toppar/ t o p a l l 2 7 p r o t n a . r t f ’ ,36 ’ toppar/ p a r a l l 2 2 p r o t . prm ’ ,37 ’ toppar/stream/prot/ t o p p a r a l l 3 6 p r o t a l d e h y d e s . r t f ’ ,38 ’ toppar/stream/prot/ t o p p a r a l l 3 6 p r o t p y r i d i n e s . r t f ’ ,39 ’ toppar/stream/prot/ t o p p a r a l l 3 6 p r o t f l u o r o a l k a n e s .

r t f ’ ,40 # ’ toppar/toph19 . inp ’ ,41 # ’ toppar/param19 . inp ’ ,42 ’ toppar/par hbond . inp ’43 )44 modeller = Modeller ( pdb . topology , pdb . p o s i t i o n s )45 modeller . addHydrogens ( )46 system = psf . createSystem ( params , nonbondedMethod=NoCutoff ,

nonbondedCutoff=1∗nanometer , c o n s t r a i n t s =HBonds , i m p l i c i t S o l v e n t=HCT, i m p l i c i t S o l v e n t S a l t C o n c =0.1∗moles/ l i t e r )

96

C.6. Molecular Dynamics simulation

47 i n t e g r a t o r = LangevinIntegrator (300∗ kelvin , 1/picosecond , 1∗femtoseconds )

48 s imulat ion = Simulat ion ( psf . topology , system , i n t e g r a t o r )49 s imulat ion . contex t . s e t P o s i t i o n s ( pdb . p o s i t i o n s )50 s imulat ion . minimizeEnergy ( maxI tera t ions =2000)51 s imulat ion . r e p o r t e r s . append ( DCDReporter ( dcd path , 20000) )52 s imulat ion . s tep ( 2 0 0 0 0 0 )53 s imulat ion . saveCheckpoint ( dcd path + ” . chk” )54

55 def spawn ( pdb code , gpu ) :56 subprocess . run ( [ ”python3” , ” molecular dynamics . py” , s t r ( gpu ) , ”−−

run” , pdb code ] )57

58 def worker ( gpu ) :59 while True :60 pdb = q . get ( )61 i f pdb i s None :62 break63 spawn ( pdb , gpu )64 q . task done ( )65

66

67 def set queue ( gpus , pdbs ) :68 threads = [ ]69 f o r gpu in gpus :70 t = threading . Thread ( t a r g e t =worker , args =(gpu , ) )71 t . s t a r t ( )72 threads . append ( t )73 f o r pdb in pdbs :74 q . put ( pdb )75 q . j o i n ( )76 f o r i in gpus :77 q . put (None )78 f o r t in threads :79 t . j o i n ( )80

81 i f name == ” main ” :82 parser = argparse . ArgumentParser ( )83 parser . add argument ( ’gpu ’ )84 parser . add argument ( ’−− f i l e ’ )85 parser . add argument ( ’−−run ’ )86 args = parser . parse args ( )87

88 gpu = args . gpu89 i f args . run :90 os . environ [ ”CUDA VISIBLE DEVICES” ] = gpu91 pdb code = args . run92 pdb path = s t r ( Path ( ’ ./ pdb autopsf ’ ) / ( pdb code + ’ . pdb ’ ) )93 psf path = s t r ( Path ( ’ ./ p s f a u t o p s f ’ ) / ( pdb code + ’ . psf ’ ) )94 dcd path = s t r ( Path ( ’ ./ dcd ’ ) / ( pdb code + ’ . dcd ’ ) )95 i f not Path ( dcd path ) . e x i s t s ( ) :96 p r i n t ( ’@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@’ , pdb code )97 generate snapshots ( pdb path , psf path , dcd path )98 e l s e :

97

C. Pipeline scripts

99 q = queue . Queue ( )100 gpus = gpu . s p l i t ( ’ , ’ )101 pdbs = [ ]102 with open ( args . f i l e ) as f :103 f i l e s = f . read ( ) . s p l i t ( )104 f o r pdb code in f i l e s :105 pdb path = ( Path ( ’ ./ pdb autopsf ’ ) / ( pdb code + ’

p r o t e i n a u t o p s f . pdb ’ ) )106 dcd path = Path ( ’ ./ dcd ’ ) / ( pdb path . stem + ’ . dcd ’ )107 i f not dcd path . e x i s t s ( ) :108 pdbs . append ( pdb path . stem )109 set queue ( gpus , pdbs )

98

Declaration of originality The signed declaration of originality is a component of every semester paper, Bachelor’s thesis, Master’s thesis and any other degree paper undertaken during the course of studies, including the respective electronic versions. Lecturers may also require a declaration of originality for other written papers compiled for their courses. __________________________________________________________________________ I hereby confirm that I am the sole author of the written work here enclosed and that I have compiled it in my own words. Parts excepted are corrections of form and content by the supervisor. Title of work (in block letters):

Authored by (in block letters): For papers written by groups the names of all authors are required. Name(s): First name(s):

With my signature I confirm that − I have committed none of the forms of plagiarism described in the ‘Citation etiquette’ information

sheet. − I have documented all methods, data and processes truthfully. − I have not manipulated any data. − I have mentioned all persons who were significant facilitators of the work.

I am aware that the work may be screened electronically for plagiarism. Place, date Signature(s)

For papers written by groups the names of all authors are

required. Their signatures collectively guarantee the entire content of the written paper.

NEURAL NETWORKS FOR IMPROVING DRUG DISCOVERY EFFICIENCY

HASSAN HARRIROU HUSSEIN

ZÜRICH, 31.03.2019

Documents

In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 2. 15. · Hussein Hassan Harrirou Sunday 31st March, 2019 Supervisor: Dr. Thomas Lemmin, Prof. Ce Zhang