Generative Models for Graph-Based Protein Designpeople.csail.mit.edu/ingraham/slides/neurips19_protein_design.pdf · used for protein engineering. We found the average perplexity

References

Generative Models for Graph-Based Protein DesignJohn Ingraham, Vikas K. Garg, Regina Barzilay, Tommi Jaakkola{ingraham, vgarg, regina, tommi}@csail.mit.edu MIT CSAIL, Cambridge MA, USA

Opportunities: de novo therapeutics, catalysts, and materials

Challenges: Despite many successes from conventional methods such as Rosetta, first designs often fail (unreliability), outcomes are sensitive

to methodology (non-robustness), and design throughput is slow

… could we learn to generate designs directly?

MSGIAVSDDCVQKFNELKLGHQ

Structure

Perplexity (per amino acid)

Learning protein design, directly

Result: Comparison of features and architecture

Result: Structure-conditioned language models can generalize to unseen 3D structures

Protein foldingSequence

Protein design“inverse folding”

Approach: Graph-based sequence generation

Local attention builds up context for structure (and sequence)

Translation Rotation

Distance Direction

x<latexit sha1_base64="PHVAw8wGpo6rlMbG/T/EP1rphRc=">AAACPHicbVBNbxMxEPW20IYUSluOHLCIKnGKdlskOFZUqnpBBJU0kZKomnVmEyv+WNmzpZG1P6LX8l/4H9y5Ia6ccdIcaMuTLD2/mdG8eXmppKc0/ZGsrT96vLHZeNLcevps+/nO7t65t5UT2BVWWdfPwaOSBrskSWG/dAg6V9jLZ8eLeu8SnZfWfKF5iSMNEyMLKYCi1AvDvOBX9cVOK22nS/CHJFuRFluhc7GbvBqOrag0GhIKvB9kaUmjAI6kUFg3h5XHEsQMJjiI1IBGPwpLvzXfj8qYF9bFZ4gv1X8nAmjv5zqPnRpo6u/XFuL/aoOKivejIE1ZERpxu6ioFCfLF8fzsXQoSM0jAeFk9MrFFBwIihHd2SJ07uRkSvGSM6SzaMaqk2g2KHAT9Mu/r4OxToOqw6eP/TrIr9ZAHXSU62aMNLsf4ENyftDODtsHn9+2jj6swm2wl+w1e8My9o4dsVPWYV0m2Ixdsxv2Lfme/Ex+Jb9vW9eS1cwLdgfJn7/VfLAN</latexit>s

<latexit sha1_base64="pTh5pmDNWedk+N1HkcXyHDuRRiA=">AAACPHicbVBNaxRBEO1J1MT1K4lHDzYugqdlJgmYY1AIXsRI3A/YXUJNb81us/0xdNckLM38iFyT/+L/8O5NvHq29+PgbnzQ8PpVFfXq5aWSntL0R7K1/eDho53dx40nT589f7G3f9DxtnIC28Iq63o5eFTSYJskKeyVDkHnCrv59OO83r1C56U132hW4lDD2MhCCqAodcMgL7ivL/eaaStdgN8n2Yo02Qrnl/vJ68HIikqjIaHA+36WljQM4EgKhXVjUHksQUxhjP1IDWj0w7DwW/O3URnxwrr4DPGF+u9EAO39TOexUwNN/GZtLv6v1q+oOBkGacqK0IjloqJSnCyfH89H0qEgNYsEhJPRKxcTcCAoRrS2RejcyfGE4iUXSBfRjFVn0WxQ4MboF39fB2OdBlWHL597dZDX1kAddJTrRow02wzwPukctrKj1uHX4+bph1W4u+wVe8PesYy9Z6fsEztnbSbYlN2wW3aXfE9+Jr+S38vWrWQ185KtIfnzF8xdsAg=</latexit>

QuaternionRadial basis

OTi (xj � xi)

<latexit sha1_base64="ni9oVgVF9LFn4oH4ZDbZWjuUN3g=">AAACZHicbVBdaxNBFJ2sXzVWm1p8EnQwCPWhYbcK7WNREF+klTZtIBuX2cndZOx8LDN31TDMr/DX+Kq/wj/g73CyjWBbDwycOfde7rmnrKVwmKa/OsmNm7du31m72723fv/BRm/z4akzjeUw5EYaOyqZAyk0DFGghFFtgalSwll5/mZZP/sM1gmjT3BRw0SxmRaV4AyjVPR2fF5W9DB8PClELqHC7Vb4GopPdIf+5SK3YjbHF0Wvnw7SFvQ6yVakT1Y4KjY7T/Op4Y0CjVwy58ZZWuPEM4uCSwjdvHFQM37OZjCOVDMFbuLbuwJ9HpUprYyNTyNt1X8nPFPOLVQZOxXDubtaW4r/q40brPYnXui6QdD8YlHVSIqGLkOiU2GBo1xEwrgV0Svlc2YZxxjlpS1clW008ZJjwONoxsi30ayXzM7AtX8XvDZWMRn84ftR8OKL0Sx4FeXQjZFmVwO8Tk53B9nLwe6HV/2D16tw18hj8oxsk4zskQPyjhyRIeHkG/lOfpCfnd/JerKVPLpoTTqrmS1yCcmTP4C5u0c=</latexit>

OTj Oi

<latexit sha1_base64="+YBlgLAzDailcAaHAG0zm91IC6w=">AAACTHicbVBNaxRBEO3ZaIzr1ybiyYONi+BpmYmCHoOCeJFEkk0Wdtehprdmt5P+GLprlKWZH+NV/4t3/4e3ELB3MweT+KDh1asq6vUrKiU9penvpLNx6/bmna273Xv3Hzx81NveOfa2dgKHwirrRgV4VNLgkCQpHFUOQRcKT4qz96v+yVd0XlpzRMsKpxrmRpZSAEUp7z0Jk6Lk+82Xo/yUtzyXea+fDtI1+E2StaTPWhzk28mzycyKWqMhocD7cZZWNA3gSAqFTXdSe6xAnMEcx5Ea0OinYe2/4S+iMuOldfEZ4mv1340A2vulLuKkBlr4672V+L/euKby7TRIU9WERlweKmvFyfJVGHwmHQpSy0hAOBm9crEAB4JiZFeuCF04OV9Q/Mkh0mE0Y9WHaDYocHP069o3wVinQTVh/9OoCfKbNdAEHeWmGyPNrgd4kxzvDrJXg93Pr/t779pwt9hT9py9ZBl7w/bYR3bAhkywwL6zH+xn8iv5k5wnF5ejnaTdecyuoLP5F1KztJA=</latexit>

OTi

xj � xi||xj � xi ||

<latexit sha1_base64="2iKQfdSkmpjuQG6xOPOzp2VXbKE=">AAACfHicbVDbbhMxEHWWS0u4NIVHHrAISEio0W6pgMcKBOIFtahNGykJK68zm5j6srJngcjxF/E1PAL/gnC2UdULR7J0fM6MZuYUlRQO0/RXK7l2/cbNtfVb7dt37t7b6GzeP3Kmthz63EhjBwVzIIWGPgqUMKgsMFVIOC5O3i79469gnTD6EOcVjBWbalEKzjBKeeedHxUl3Qu5+HxIR6Vl3NNG+h7yL3TrjIvgF4vzzplBF4uQd7ppL21Ar5JsRbpkhf18s/VoNDG8VqCRS+bcMEsrHHtmUXAJoT2qHVSMn7ApDCPVTIEb++beQJ9GZUJLY+PTSBv1fIdnyrm5KmKlYjhzl72l+D9vWGP5euyFrmoEzU8HlbWkaOgyPDoRFjjKeSSMWxF3pXzGYmgYI74whavCiukM4yUHgAdxGSPfx2W9ZHYKrvm74LWxisng9z4OghffjGbBqyiHdow0uxzgVXK03cte9LY/7XR336zCXScPyWPyjGTkFdklH8g+6RNOfpCf5Df50/qbPEmeJ1unpUlr1fOAXEDy8h/xlcUD</latexit>

||xj � xi ||<latexit sha1_base64="wdS7Q9/k3LndTMeNCMPGeEQK8Uo=">AAACT3icbZBda1NBEIb3xI+28Su1N4IXLgbBG8M5VaiXRaF4I1Zq2kASwpzNnGTtfhx259iGzfHXeKv/xUt/iXfiJh7Btg4sPPvODDPz5qWSntL0R9K6dv3GzY3Nrfat23fu3uts3z/2tnIC+8Iq6wY5eFTSYJ8kKRyUDkHnCk/y09er/MkndF5a84EWJY41zIwspACK0qTzYLkMo7zg5/XkI3/2F+VyOel00166Dn4Vsga6rInDyXbyaDS1otJoSCjwfpilJY0DOJJCYd0eVR5LEKcww2FEAxr9OKxPqPmTqEx5YV18hvha/bcjgPZ+ofNYqYHm/nJuJf4vN6yoeDkO0pQVoRF/BhWV4mT5yg8+lQ4FqUUEEE7GXbmYgwNB0bULU4TOnZzNKV5yhHQUl7HqIC4bFLgZ+vXf18FYp0HV4d3bQR3kmTVQBx3luh0tzS4beBWOd3vZ897u+xfd/VeNuZvsIXvMnrKM7bF99oYdsj4T7DP7wr6yb8n35Gfyq9WUtpIGdtiFaG39ButHtWw=</latexit>

Conclusion: Deep generative models can learn to design proteins directly from structure

Graph features represent molecular geometry

Unit vector

Edge features are relative transformations between frames (SE(3)-invariant)

Published as a workshop paper at ICLR 2019

Table 1: Null perplexities

Null model Perplexity Conditioned on

Uniform 20.00 -Natural frequencies 17.83 Random position in a natural proteinPfam HMM profiles 11.64 Specific position in a specific protein family

Each layer of the encoder implements a multi-head self-attention component, where head ` 2 [L] canattend to a separate subspace of the embeddings via learned query, key and value transformations(Vaswani et al., 2017). The queries are derived from the current embedding at node i while thekeys and values from the relational information rij = (hj , eij) at adjacent nodes j 2 N(i, k).Specifically, W (`)

q maps hi to query embeddings q(`)i , W (`)

z maps pairs rij to key embeddings z(`)ij

for j 2 N(i, k), and W(`)v maps the same pairs rij to value embeddings v(`)

ij for each i 2 [N ], ` 2[L]. Decoupling the mappings for keys and values allows each to depend on different subspaces ofthe representation.

We compute the attention a(`)ij between query q(`)i and key z

(`)ij as a function of their scaled inner

product:

a(`)ij =exp(m(`)

ij )X

j02N(i,k)

exp(m(`)ij0 )

, where m(`)ij =

q(`)i

>z(`)ij

pd

.

The results of each attention head l are collected as the weighted sum h(`)i =

X

j2N(i,k)

a(`)ij v(`)ij and

then concatenated and transformed to give the update �hi = Wo Concat⇣h(1)i , . . . ,h(L)

i

⌘.

We update the embeddings with this residual and alternate between these self-attention layers andposition-wise feedforward layers as in the original Transformer (Vaswani et al., 2017). We stackmultiple layers atop each other, and thereby obtain continually refined embeddings as we traversethe layers bottom up. The encoder yields the embeddings produced by the topmost layer as itsoutput.

Decoder Our decoder module has the same structure as the encoder but with augmented relationalinformation rij that allows access to the preceding sequence elements s<i in a causally consistent

manner. Whereas the keys and values of the encoder are based on the relational information rij =(hj , eij), the decoder can additionally access sequence elements sj as

r(dec)ij =

((h(dec)

j , eij ,g(sj)) i > j

(h(enc)j , eij ,0) i j

.

Here h(dec)j is the embedding of node j in the current layer of the decoder, h(enc)

j is the embeddingof node j in the final layer of the encoder, and g(sj) is a sequence embedding of amino acid sjat node j. This concatenation and masking structure ensures that sequence information only flowsto position i from positions j < i, but still allows position i to attend to subsequent structuralinformation.

We stack three layers of self-attention and position-wise feedforward modules for the encoder anddecoder with a hidden dimension of 128 throughout the experiments5.

2.3 TRAINING

Dataset To evaluate the ability of the models to generalize across different protein folds, we col-lected a dataset based on the CATH hierarchical classification of protein structure (Orengo et al.,

5except for the decoder-only language model experiment which used a hidden dimension of 256

5

r(dec)i j =

�(h(dec)j , ei j ,g(sj)) i > j

(h(enc)j , ei j , 0) i j<latexit sha1_base64="z+ZXogPMAGefrBicUUhHpDg/Jwk=">AAAC83icdVFdb9MwFHXC1ygf6+CRB6xVoFZCVTKQ4AU0gYT2gtgY3So1JXKcm9Sd4wTbASrLv4Q3xCv/gD/Cv8FJ88C6cSVLx+fec+/1cVJxpnQQ/PH8K1evXb+xdbN36/adu9v9nXsnqqwlhQkteSmnCVHAmYCJZprDtJJAioTDaXL2psmffgGpWCk+6lUF84LkgmWMEu2ouP/bREmGpf1khpGGb9qkQO3IxoYtLX6JcZRAzoShboSyGA/b8oXLLzclT3Cbg7W2u+V2qOLlaPQYY4Zf4SWOost7gPhvj8Cu5RGHz00HEGm3T9wfBOOgDXwRhB0YoC4O4x3vYZSWtC5AaMqJUrMwqPTcEKkZ5WB7Ua2gIvSM5DBzUJAC1Ny0Nlv8yDEpzkrpjtC4Zf9VGFIotSoSV1kQvVCbuYa8LDerdfZibpioat340A7Kao51iZs/wymTQDVfOUCoZG5XTBdEEqrdz56bQotEsnyh3UuOQR+7ZUr+1i1rOJE5qPaurBGlLAi35v27qTXsaymINYWjbc9ZGm4aeBGc7I3Dp+O9o2eD/deduVvoAdpFQxSi52gfHaBDNEHU2/UOvCPvg1/73/0f/s91qe91mvvoXPi//gJJfu3e</latexit>

r(enc)i j = (h(enc)j , ei j)<latexit sha1_base64="rGj+IDqmfxKRIyKcHBbKyv83/Ck=">AAACgHicbVBdaxNBFJ2sXzV+pfroQweDkILE3Sq0CEJRKO2DWKlpA0kMs5O7ybTzsczcVcMwv8lf40Nf2r/SyTYPNvXCwJlzzuXee/JSCodpet5I7ty9d//B2sPmo8dPnj5rrT8/dqayHHrcSGP7OXMghYYeCpTQLy0wlUs4yc8+L/STn2CdMPo7zksYKTbVohCcYaTGrQM/zAtqww/fGSL8Rg+ah80w9uI00I+Udmp9FonTVc8bWmtwbd4ct9ppN62L3gbZErTJsg7H642N4cTwSoFGLplzgywtceSZRcElhOawclAyfsamMIhQMwVu5OubA30dmQktjI1PI63Zfzs8U87NVR6diuHMrWoL8n/aoMJiZ+SFLitc3FkPKipJ0dBFgHQiLHCU8wgYtyLuSvmMWcYxxnxjCle5FdMZxkuOAI/iMkbuxWW9ZHYKrv674LWxisngv37pBy9+Gc2CV5EOzRhpthrgbXC81c3edbe+vW/vflqGu0ZeklekQzKyTXbJPjkkPcLJH/KXXJDLJEk6ydsku7YmjWXPC3Kjkg9X2eHGUQ==</latexit>

Encoder attends to node & edge tuples:

Decoder attends to node, edge, sequence tripleswith masking:

Multi-head attention like Transformer [Vaswani et al 2017, Shaw et al 2018] with node embeddings hi and edges rij

hi<latexit sha1_base64="htP3OH0Iq4fDVgngLg+XftaLxDo=">AAACQHicbVDLahtBEJx1nNhRXnZyzMGDRSAnsesE4qNJIOQSYmPLFkhC9I56pcHzWGZ6E8Qwn5Gr/S/5i/xBbsFXnzySdfCrYKC6upuuqbJW0lOe/81WHq0+frK2/rT17PmLl682Nl8fe9s4gV1hlXW9EjwqabBLkhT2aoegS4Un5emXef/kJzovrTmiWY1DDRMjKymAktQPg7Li0zgKMo422nknX4DfJ8WStNkS+6PNbGswtqLRaEgo8L5f5DUNAziSQmFsDRqPNYhTmGA/UQMa/TAsPEf+LiljXlmXniG+UG9uBNDez3SZJjXQ1N/tzcWHev2Gqt1hkKZuCI24PlQ1ipPl8wD4WDoUpGaJgHAyeeViCg4EpZhuXRG6dHIypfSTQ6TDZMaqr8lsUOAm6Be1j8FYp0HF8ON7Lwb5yxqIQSc5tlKkxd0A75PjnU7xobNz8LG993kZ7jp7y7bZe1awT2yPfWP7rMsEs+w3O2Pn2Z/sX/Y/u7geXcmWO2/YLWSXV4oYseU=</latexit>

ri j<latexit sha1_base64="ZNfhRCiNVakMeF7Axz2inq/5tOc=">AAACQXicbVBNbxMxEPUWaEsK/eLIAYuoEqdotyC1x6pIFRdEUUkbKYmiWWc2ceuPlT1bFFn+G1zhv/Ar+AncEFcuOGkOtOVJlt68mdE8v7JW0lOe/8hWHjx8tLq2/ri18eTp5tb2zu65t40T2BVWWdcrwaOSBrskSWGvdgi6VHhRXr2d9y+u0XlpzSea1TjUMDGykgIoSYMwKCvu4ijIyzjabuedfAF+nxRL0mZLnI52sheDsRWNRkNCgff9Iq9pGMCRFApja9B4rEFcwQT7iRrQ6IdhYTryvaSMeWVdeob4Qv13I4D2fqbLNKmBpv5uby7+r9dvqDocBmnqhtCIm0NVozhZPk+Aj6VDQWqWCAgnk1cupuBAUMrp1hWhSycnU0o/OUM6S2asOklmgwI3Qb+ofQzGOg0qhg/vezHIz9ZADDrJsZUiLe4GeJ+c73eK1539j2/aR8fLcNfZc/aSvWIFO2BH7B07ZV0mWM2+sK/sW/Y9+5n9yn7fjK5ky51n7BayP38BhyKyYw==</latexit>

Queries Keys, Values

Structure and sequence

M

S

G

I

A

V S

Structure Encoder Sequence Decoder (autoregressive)

StructureNode (amino acid)

Backbone

Information flow

Structure G

Self-attention

Position-wise Feedforward

Node embeddingsEdge embeddings

Causal Self-attention

Position-wise Feedforward

Sequence s

Encoder

Decoder

The protein sequence design problem

Given a 3D protein structure x, what sequence(s) s will fold into it?

Input: 3D backboneOutput: Sequence of amino acids

along backbone

CartoonAtoms

p(s|x) =�i

p(si |x, s<i)<latexit sha1_base64="VqQRFXP+08qzrlkHSaw0DwFTEMY=">AAACcnicbVFdaxNBFJ2sVWv8aKpv9sHRIKRQwm4tVLBCsVB8KVZq2kASlruT2WTofDEzqw3j/Bh/ja/66P/wB3SyLmJbLwyce8693HvPFJoz69L0Vyu5tXL7zt3Ve+37Dx4+WuusPz61qjKEDojiygwLsJQzSQeOOU6H2lAQBadnxfnBUj/7TI1lSn5yC00nAmaSlYyAi1TeeaN7flyU2IavuAYXYRO/xWNt1DRnWPdszv4qW7ipzf0eC5t5p5v20zrwTZA1oIuaOM7XW8/GU0UqQaUjHKwdZal2Ew/GMcJpaI8rSzWQc5jRUYQSBLUTX18Z8MvITHGpTHzS4Zr9t8ODsHYhilgpwM3tdW1J/k8bVa58PfFM6spRSf4MKiuOncJLy/CUGUocX0QAxLC4KyZzMEBcNPbKFCIKw2ZzFy85oe4kLqP4YVzWczAzauvcBi+VEcCD/3A0DJ59URKCF5EO7Whpdt3Am+B0u5+96m9/3Onuv2vMXUUb6AXqoQzton30Hh2jASLoG/qOfqCfrd/J0+R50vxE0mp6nqArkWxdAjANwA8=</latexit>

Table 2: Per-residue perplexities for protein language modeling (lower is better). The proteinchains have been cluster-split by CATH topology, such that test includes only unseen 3D folds. Whilea structure-conditioned language model can generalize in this structure-split setting, unconditionallanguage models struggle.

Test set Short Single chain AllStructure-conditioned models

Structured Transformer (ours) 8.54 9.03 6.85

SPIN2 [8] 12.11 12.61 -Language models

LSTM (h = 128) 16.06 16.38 17.13LSTM (h = 256) 16.08 16.37 17.12LSTM (h = 512) 15.98 16.38 17.13Test set size 94 103 1120

3D structure [43], meaning that sequence similarity need not necessarily be high. At the same time,single mutations may cause a protein to break or misfold, meaning that high sequence similarityisn’t sufficient for a correct design. To deal with this, we will focus on three kinds of evaluation: (i)likelihood-based, where we test the ability of the generative model to give high likelihood to heldout sequences, (ii) native sequence recovery, where we evaluate generated sequences vs the nativesequences of templates, and (iii) experimental comparison, where we compare the likelihoods of themodel to high-throughput data from a de novo protein design experiment.

We find that our model is able to attain considerably improved statistical performance in its likelihoodswhile simultaneously providing more accurate and efficient sequence recovery.

4.1 Statistical comparison to likelihood-based models

Protein perplexities What kind of perplexities might be useful? To provide context, we firstpresent perplexities for some simple models of protein sequences in Table 1. The amino acid alphabetand its natural frequencies upper-bound perplexity at 20 and ⇠17.8, respectively. Random proteinsequences under these null models are unlikely to be functional without further selection [44]. Firstorder profiles of protein sequences such as those from the Pfam database [45], however, are widelyused for protein engineering. We found the average perplexity per letter of profiles in Pfam 32(ignoring alignment uncertainty) to be ⇠11.6. This suggests that even models with high perplexitieshigh as ⇠ 11 have the potential to be useful for the space of functional protein sequences.

The importance of structure We found that there was a significant gap between unconditionallanguage models of protein sequences and models conditioned on structure. Remarkably, for a rangeof structure-independent language models, the typical test perplexities are ⇠16-17 (Table 2), whichwere barely better than null letter frequencies (Table 1). We emphasize that the RNNs were notbroken and could still learn the training set in these capacity ranges. All structure-based models had(unsurprisingly) considerably lower perplexities. In particular, our Structured Transformer modelattained a perplexity of ⇠7 on the full test set. It seems that protein language models trained on onesubset of 3D folds (in our cluster-splitting procedure) generalize poorly to predict the sequences

Table 3: Ablation of graph features and model components. Test perplexities (lower is better).

Node features Edge features Aggregation Short Single chain AllRigid backbone

Dihedrals Distances, Orientations Attention 8.54 9.03 6.85Dihedrals Distances, Orientations PairMLP 8.33 8.86 6.55

C↵ angles Distances, Orientations Attention 9.16 9.37 7.83Dihedrals Distances Attention 9.11 9.63 7.87Flexible backbone

C↵ angles Contacts, Hydrogen bonds Attention 11.71 11.81 11.51

7

Table 2: Per-residue perplexities for protein language modeling (lower is better). The proteinchains have been cluster-split by CATH topology, such that test includes only unseen 3D folds. Whilea structure-conditioned language model can generalize in this structure-split setting, unconditionallanguage models struggle.

Test set Short Single chain AllStructure-conditioned models

Structured Transformer (ours) 8.54 9.03 6.85

SPIN2 [8] 12.11 12.61 -Language models

LSTM (h = 128) 16.06 16.38 17.13LSTM (h = 256) 16.08 16.37 17.12LSTM (h = 512) 15.98 16.38 17.13Test set size 94 103 1120

3D structure [43], meaning that sequence similarity need not necessarily be high. At the same time,single mutations may cause a protein to break or misfold, meaning that high sequence similarityisn’t sufficient for a correct design. To deal with this, we will focus on three kinds of evaluation: (i)likelihood-based, where we test the ability of the generative model to give high likelihood to heldout sequences, (ii) native sequence recovery, where we evaluate generated sequences vs the nativesequences of templates, and (iii) experimental comparison, where we compare the likelihoods of themodel to high-throughput data from a de novo protein design experiment.

We find that our model is able to attain considerably improved statistical performance in its likelihoodswhile simultaneously providing more accurate and efficient sequence recovery.

4.1 Statistical comparison to likelihood-based models

Protein perplexities What kind of perplexities might be useful? To provide context, we firstpresent perplexities for some simple models of protein sequences in Table 1. The amino acid alphabetand its natural frequencies upper-bound perplexity at 20 and ⇠17.8, respectively. Random proteinsequences under these null models are unlikely to be functional without further selection [44]. Firstorder profiles of protein sequences such as those from the Pfam database [45], however, are widelyused for protein engineering. We found the average perplexity per letter of profiles in Pfam 32(ignoring alignment uncertainty) to be ⇠11.6. This suggests that even models with high perplexitieshigh as ⇠ 11 have the potential to be useful for the space of functional protein sequences.

The importance of structure We found that there was a significant gap between unconditionallanguage models of protein sequences and models conditioned on structure. Remarkably, for a rangeof structure-independent language models, the typical test perplexities are ⇠16-17 (Table 2), whichwere barely better than null letter frequencies (Table 1). We emphasize that the RNNs were notbroken and could still learn the training set in these capacity ranges. All structure-based models had(unsurprisingly) considerably lower perplexities. In particular, our Structured Transformer modelattained a perplexity of ⇠7 on the full test set. It seems that protein language models trained on onesubset of 3D folds (in our cluster-splitting procedure) generalize poorly to predict the sequences

Table 3: Ablation of graph features and model components. Test perplexities (lower is better).

Node features Edge features Aggregation Short Single chain AllRigid backbone

Dihedrals Distances, Orientations Attention 8.54 9.03 6.85Dihedrals Distances, Orientations PairMLP 8.33 8.86 6.55

C↵ angles Distances, Orientations Attention 9.16 9.37 7.83Dihedrals Distances Attention 9.11 9.63 7.87Flexible backbone

C↵ angles Contacts, Hydrogen bonds Attention 11.71 11.81 11.51

7

Method Recovery (%) Speed (AA/s) CPU Speed (AA/s) GPURosetta 3.10 fixbb 17.9 4.88⇥ 10�1 N/A

Ours (T = 0.1) 27.6 2.22⇥ 102 1.04⇥ 104

(a) Single chain test set (103 proteins)

Method Recovery (%)Rosetta, fixbb 1 33.1Rosetta, fixbb 2 38.4

Ours (T = 0.1) 39.2

(b) Ollikainen benchmark (40 proteins)

Table 4: Improved reliability and speed compared to Rosetta. (a) On the ‘single chain’ test set,our model more accurately recovers native sequences than Rosetta fixbb with greater speed (CPU:single core of Intel Xeon Gold 5115, GPU: NVIDIA RTX 2080). This set includes NMR-basedstructures for which Rosetta is known to not be robust [46]. (b) Our model also performs favorablyon a prior benchmark of 40 proteins. All results reported as median of average over 100 designs.

of unseen folds. We believe this possibility might be important to consider when training proteinlanguage models for protein engineering and design.

Improvement over deep profile-based methods We also compared to a recent method SPIN2that predicts, using deep neural networks, protein sequence profiles given protein structures [8].Since SPIN2 is computationally intensive (minutes per protein for small proteins) and was trainedon complete proteins rather than chains, we evaluated it on two subsets of the full test set: a ‘Small’subset of the test set containing chains up to length 100 and a ‘Single chain’ subset containing onlythose models where the single chain accounted for the entire protein record in the Protein Data Bank.Both subsets discarded any chains with structural gaps (chain break). We found that our StructuredTransformer model significantly improved upon the perplexities of SPIN2 (Table 2).

Graph representations and attention mechanisms The graph-based formulation of protein de-sign can accommodate very different formulations of the problem depending on how structure isrepresented by a graph. We tested different approaches for representing the protein including bothmore ‘rigid’ design with precise geometric details, and ‘flexible’ topological design based on spatialcontacts and hydrogen bonding (Table 3). For the best perplexities, we found that using local orienta-tion information was indeed important above simple distance measures. At the same time, even thetopological features were sufficient to obtain better perplexities than SPIN2 (Table 2), which usesprecise atomic details.

In addition to varying the graph features, we also experimented with an alternative aggregationfunction from message passing neural networks [36].5 We found that a simple aggregation function�hi =

Pj MLP(hj ,hj , eij) led to the best performance of all models, where MLP(·) is a two layer

perceptron that preserves the hidden dimension of the model. We speculate that this is due to potentialoverfitting by the attention mechanism. Although this suggests room for future improvements, weuse multi-head self-attention throughout the remaining experiments.

4.2 Benchmarking protein redesign

Decoding strategies Generating protein sequence designs requires a sampling scheme for drawinghigh-likelihood sequences from the model. While beam-search or top-k sampling [47] are commonlyused heuristics for decoding, we found that simple biased sampling from the temperature adjusteddistributions p(T )(s|x) =

Qi

p(si|x,s<i)1/T

Pa p(a|x,s<i)1/T

was sufficient for obtaining sequences with higherlikelihoods than native. We used a temperature of T = 0.1 selected from sequence recovery onvalidation. For conditional redesign of a subset of positions in a protein, we speculate that thelikelihood calculation is sufficiently fast such that MCMC-based approaches such as Gibbs samplingmay be feasible.

5We thank one of our reviewers for this suggestion.

8

Method Recovery (%) Speed (AA/s) CPU Speed (AA/s) GPURosetta 3.10 fixbb 17.9 4.88⇥ 10�1 N/A

Ours (T = 0.1) 27.6 2.22⇥ 102 1.04⇥ 104

(a) Single chain test set (103 proteins)

Method Recovery (%)Rosetta, fixbb 1 33.1Rosetta, fixbb 2 38.4

Ours (T = 0.1) 39.2

(b) Ollikainen benchmark (40 proteins)

Table 4: Improved reliability and speed compared to Rosetta. (a) On the ‘single chain’ test set,our model more accurately recovers native sequences than Rosetta fixbb with greater speed (CPU:single core of Intel Xeon Gold 5115, GPU: NVIDIA RTX 2080). This set includes NMR-basedstructures for which Rosetta is known to not be robust [46]. (b) Our model also performs favorablyon a prior benchmark of 40 proteins. All results reported as median of average over 100 designs.

of unseen folds. We believe this possibility might be important to consider when training proteinlanguage models for protein engineering and design.

Improvement over deep profile-based methods We also compared to a recent method SPIN2that predicts, using deep neural networks, protein sequence profiles given protein structures [8].Since SPIN2 is computationally intensive (minutes per protein for small proteins) and was trainedon complete proteins rather than chains, we evaluated it on two subsets of the full test set: a ‘Small’subset of the test set containing chains up to length 100 and a ‘Single chain’ subset containing onlythose models where the single chain accounted for the entire protein record in the Protein Data Bank.Both subsets discarded any chains with structural gaps (chain break). We found that our StructuredTransformer model significantly improved upon the perplexities of SPIN2 (Table 2).

Graph representations and attention mechanisms The graph-based formulation of protein de-sign can accommodate very different formulations of the problem depending on how structure isrepresented by a graph. We tested different approaches for representing the protein including bothmore ‘rigid’ design with precise geometric details, and ‘flexible’ topological design based on spatialcontacts and hydrogen bonding (Table 3). For the best perplexities, we found that using local orienta-tion information was indeed important above simple distance measures. At the same time, even thetopological features were sufficient to obtain better perplexities than SPIN2 (Table 2), which usesprecise atomic details.

In addition to varying the graph features, we also experimented with an alternative aggregationfunction from message passing neural networks [36].5 We found that a simple aggregation function�hi =

Pj MLP(hj ,hj , eij) led to the best performance of all models, where MLP(·) is a two layer

perceptron that preserves the hidden dimension of the model. We speculate that this is due to potentialoverfitting by the attention mechanism. Although this suggests room for future improvements, weuse multi-head self-attention throughout the remaining experiments.

4.2 Benchmarking protein redesign

Decoding strategies Generating protein sequence designs requires a sampling scheme for drawinghigh-likelihood sequences from the model. While beam-search or top-k sampling [47] are commonlyused heuristics for decoding, we found that simple biased sampling from the temperature adjusteddistributions p(T )(s|x) =

Qi

p(si|x,s<i)1/T

Pa p(a|x,s<i)1/T

was sufficient for obtaining sequences with higherlikelihoods than native. We used a temperature of T = 0.1 selected from sequence recovery onvalidation. For conditional redesign of a subset of positions in a protein, we speculate that thelikelihood calculation is sufficiently fast such that MCMC-based approaches such as Gibbs samplingmay be feasible.

5We thank one of our reviewers for this suggestion.

8

Result: Improved speed and accuracy over conventional methods

Ollikainen et al Benchmark (40 structures; re-split training for 0 topology overlap)

Most sequence dependencies are spatially local

Dense SparseO(N2)<latexit sha1_base64="4vjslFDeXzZarFas9wdIQmHo98s=">AAACR3icbVDLThtBEJx1QgBDEpMcc8gqFhK5WLsmEhwRSBEXgiNisGQbq3fctkfMYzXTC7JG+ye5kn/hE/IVuaEcM34cwqOkkaqru9U1leVSOEqS31HlxcuVV6tr69WNzddv3ta23p07U1iObW6ksZ0MHEqhsU2CJHZyi6AyiRfZ1dGsf3GN1gmjf9A0x76CsRYjwYGCNKjVegpowkH603Ln22Xz86BWTxrJHPFTki5JnS3RGmxFH3tDwwuFmrgE57ppklPfgyXBJZbVXuEwB34FY+wGqkGh6/u59TLeDsowHhkbnqZ4rv6/4UE5N1VZmJwZdY97M/G5Xreg0X7fC50XhJovDo0KGZOJZznEQ2GRk5wGAtyK4DXmE7DAKaT14ApXmRXjCYWfnCGdBTNGfg1mvQQ7RjevXem1sQpk6U9POqUXN0ZD6VWQy2qINH0c4FNy3myku43m9y/1g8NluGvsA/vEdljK9tgBO2Yt1macXbOf7Jb9iu6iP9F99HcxWomWO+/ZA1SifzP4spE=</latexit>

O(Nk)<latexit sha1_base64="eghJZsAl12Kjv1RvLMThmAXvbdM=">AAACRnicbVDLbhNBEOw1AYJ5xIEjh6ywkMLF2g1IcIxAQrlAEiVOLNmW1Ttu2yPPYzXTC7JG+yVcyb/wC/wEN8Q1Y8eHPChppOrqbnVNFaWSnrPsd9K4t3H/wcPNR83HT54+22ptPz/ztnKCusIq63oFelLSUJclK+qVjlAXis6L+adl//wbOS+tOeVFSUONUyMnUiBHadTaGmjkmUAVDuvdr/M3o1Y762QrpHdJviZtWONotJ3sDMZWVJoMC4Xe9/Os5GFAx1IoqpuDylOJYo5T6kdqUJMfhpXzOn0dlXE6sS4+w+lKvb4RUHu/0EWcXPr0t3tL8X+9fsWTD8MgTVkxGXF1aFKplG26jCEdS0eC1SISFE5Gr6mYoUPBMawbV4QunJzOOP7khPgkmrHqczQbFLop+VXt62Cs06jqcPilVwf53Rqsg45y3YyR5rcDvEvO9jr5287e8bv2/sd1uJvwEl7BLuTwHvbhAI6gCwIq+AE/4SL5lfxJ/ib/rkYbyXrnBdxAAy4BxayyYg==</latexit>

k-NN Orientations, k = 30

Distances

Spatial adjacency gives the relevant context for sequences

Why so low? This set contains many NMR structures (rather than X-ray) for which conventional methods are not robust

Evolutionary evidence: sequence covariation

predicts spatial contacts[Marks et al 2012]

Engineering evidence: Rosetta uses spatially local pairwise Markov

Random Fields[Leaver-Fay et al 2011]

Our single chain test set (103 structures)

~ 400x speedup on one core of CPU ~ 20,000x speedup GPU

Atoms Cartoon

QETIEVEDEEEARRVAKELRKKGYEVKIERRGNKWHVHRT

x<latexit sha1_base64="PHVAw8wGpo6rlMbG/T/EP1rphRc=">AAACPHicbVBNbxMxEPW20IYUSluOHLCIKnGKdlskOFZUqnpBBJU0kZKomnVmEyv+WNmzpZG1P6LX8l/4H9y5Ia6ccdIcaMuTLD2/mdG8eXmppKc0/ZGsrT96vLHZeNLcevps+/nO7t65t5UT2BVWWdfPwaOSBrskSWG/dAg6V9jLZ8eLeu8SnZfWfKF5iSMNEyMLKYCi1AvDvOBX9cVOK22nS/CHJFuRFluhc7GbvBqOrag0GhIKvB9kaUmjAI6kUFg3h5XHEsQMJjiI1IBGPwpLvzXfj8qYF9bFZ4gv1X8nAmjv5zqPnRpo6u/XFuL/aoOKivejIE1ZERpxu6ioFCfLF8fzsXQoSM0jAeFk9MrFFBwIihHd2SJ07uRkSvGSM6SzaMaqk2g2KHAT9Mu/r4OxToOqw6eP/TrIr9ZAHXSU62aMNLsf4ENyftDODtsHn9+2jj6swm2wl+w1e8My9o4dsVPWYV0m2Ixdsxv2Lfme/Ex+Jb9vW9eS1cwLdgfJn7/VfLAN</latexit>

Structure Attributed Graph SequenceV(x), E(x)

<latexit sha1_base64="Q54NMmz3+pOV7tX6NT2BSUNOrUc=">AAACYXicbVBda1NBEN1cv9r4ldbHPrgYhCoS7q2CPhZF8UWs1KSBJIS5m7nJ0v247M5Vw7L/wV/jq/4Nn/0jbtKgtvXAwtkzM8yZU9ZKesrzn63sytVr129sbbdv3rp9525nZ3fgbeME9oVV1g1L8KikwT5JUjisHYIuFZ6Up69W9ZNP6Ly05iMta5xomBtZSQGUpGnn8VgDLQSoMIj7YVxW/Et89IT/UV//Vaedbt7L1+CXSbEhXbbB0XSndX88s6LRaEgo8H5U5DVNAjiSQmFsjxuPNYhTmOMoUQMa/SSsj4r8YVJmvLIuPUN8rf47EUB7v9Rl6lyZ9RdrK/F/tVFD1YtJkKZuCI04W1Q1ipPlq4T4TDoUpJaJgHAyeeViAQ4EpRzPbRG6dHK+oHTJMdJxMmPVm2Q2KHBz9Ou/j8FYp0HF8P7dMAb52RqIQSc5tlOkxcUAL5PBQa942jv48Kx7+HIT7hbbYw/YPivYc3bI3rIj1meCfWXf2Hf2o/Ur28462e5Za9bazNxj55Dt/QY1aro5</latexit>

s<latexit sha1_base64="pTh5pmDNWedk+N1HkcXyHDuRRiA=">AAACPHicbVBNaxRBEO1J1MT1K4lHDzYugqdlJgmYY1AIXsRI3A/YXUJNb81us/0xdNckLM38iFyT/+L/8O5NvHq29+PgbnzQ8PpVFfXq5aWSntL0R7K1/eDho53dx40nT589f7G3f9DxtnIC28Iq63o5eFTSYJskKeyVDkHnCrv59OO83r1C56U132hW4lDD2MhCCqAodcMgL7ivL/eaaStdgN8n2Yo02Qrnl/vJ68HIikqjIaHA+36WljQM4EgKhXVjUHksQUxhjP1IDWj0w7DwW/O3URnxwrr4DPGF+u9EAO39TOexUwNN/GZtLv6v1q+oOBkGacqK0IjloqJSnCyfH89H0qEgNYsEhJPRKxcTcCAoRrS2RejcyfGE4iUXSBfRjFVn0WxQ4MboF39fB2OdBlWHL597dZDX1kAddJTrRow02wzwPukctrKj1uHX4+bph1W4u+wVe8PesYy9Z6fsEztnbSbYlN2wW3aXfE9+Jr+S38vWrWQ185KtIfnzF8xdsAg=</latexit>

x<latexit sha1_base64="PHVAw8wGpo6rlMbG/T/EP1rphRc=">AAACPHicbVBNbxMxEPW20IYUSluOHLCIKnGKdlskOFZUqnpBBJU0kZKomnVmEyv+WNmzpZG1P6LX8l/4H9y5Ia6ccdIcaMuTLD2/mdG8eXmppKc0/ZGsrT96vLHZeNLcevps+/nO7t65t5UT2BVWWdfPwaOSBrskSWG/dAg6V9jLZ8eLeu8SnZfWfKF5iSMNEyMLKYCi1AvDvOBX9cVOK22nS/CHJFuRFluhc7GbvBqOrag0GhIKvB9kaUmjAI6kUFg3h5XHEsQMJjiI1IBGPwpLvzXfj8qYF9bFZ4gv1X8nAmjv5zqPnRpo6u/XFuL/aoOKivejIE1ZERpxu6ioFCfLF8fzsXQoSM0jAeFk9MrFFBwIihHd2SJ07uRkSvGSM6SzaMaqk2g2KHAT9Mu/r4OxToOqw6eP/TrIr9ZAHXSU62aMNLsf4ENyftDODtsHn9+2jj6swm2wl+w1e8My9o4dsVPWYV0m2Ixdsxv2Lfme/Ex+Jb9vW9eS1cwLdgfJn7/VfLAN</latexit>

Structure Attributed Graph SequenceV(x), E(x)

<latexit sha1_base64="Q54NMmz3+pOV7tX6NT2BSUNOrUc=">AAACYXicbVBda1NBEN1cv9r4ldbHPrgYhCoS7q2CPhZF8UWs1KSBJIS5m7nJ0v247M5Vw7L/wV/jq/4Nn/0jbtKgtvXAwtkzM8yZU9ZKesrzn63sytVr129sbbdv3rp9525nZ3fgbeME9oVV1g1L8KikwT5JUjisHYIuFZ6Up69W9ZNP6Ly05iMta5xomBtZSQGUpGnn8VgDLQSoMIj7YVxW/Et89IT/UV//Vaedbt7L1+CXSbEhXbbB0XSndX88s6LRaEgo8H5U5DVNAjiSQmFsjxuPNYhTmOMoUQMa/SSsj4r8YVJmvLIuPUN8rf47EUB7v9Rl6lyZ9RdrK/F/tVFD1YtJkKZuCI04W1Q1ipPlq4T4TDoUpJaJgHAyeeViAQ4EpRzPbRG6dHK+oHTJMdJxMmPVm2Q2KHBz9Ou/j8FYp0HF8P7dMAb52RqIQSc5tlOkxcUAL5PBQa942jv48Kx7+HIT7hbbYw/YPivYc3bI3rIj1meCfWXf2Hf2o/Ur28462e5Za9bazNxj55Dt/QY1aro5</latexit>

s<latexit sha1_base64="pTh5pmDNWedk+N1HkcXyHDuRRiA=">AAACPHicbVBNaxRBEO1J1MT1K4lHDzYugqdlJgmYY1AIXsRI3A/YXUJNb81us/0xdNckLM38iFyT/+L/8O5NvHq29+PgbnzQ8PpVFfXq5aWSntL0R7K1/eDho53dx40nT589f7G3f9DxtnIC28Iq63o5eFTSYJskKeyVDkHnCrv59OO83r1C56U132hW4lDD2MhCCqAodcMgL7ivL/eaaStdgN8n2Yo02Qrnl/vJ68HIikqjIaHA+36WljQM4EgKhXVjUHksQUxhjP1IDWj0w7DwW/O3URnxwrr4DPGF+u9EAO39TOexUwNN/GZtLv6v1q+oOBkGacqK0IjloqJSnCyfH89H0qEgNYsEhJPRKxcTcCAoRrS2RejcyfGE4iUXSBfRjFVn0WxQ4MboF39fB2OdBlWHL597dZDX1kAddJTrRow02wzwPukctrKj1uHX4+bph1W4u+wVe8PesYy9Z6fsEztnbSbYlN2wW3aXfE9+Jr+S38vWrWQ185KtIfnzF8xdsAg=</latexit>

2. Language modeling decoder

1. Represent structure as graph

1. O'Connell, James, et al. "SPIN2: Predicting sequence profiles from protein structures using deep neural networks." 2018

2. Conchúir, Shane Ó., et al. "A web resource for standardized benchmark datasets, metrics, and Rosetta protocols for macromolecular modeling and design.” 2015

3. Leaver-Fay, Andrew, et al. "ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules.” 2011

4. Vaswani, Ashish, et al. "Attention is all you need.” 20175. Marks, Debora S., Thomas A. Hopf, and Chris Sander. "Protein structure prediction from sequence

variation.” 2012

Implication: Structure the computation to focus on spatial interaction

Dataset creation

CATHdb 40%NR↓

Full chains, 500 AA↓

Split by topology↓

~18,000 chainsIn train

p(s|x) =�i

p(si |x, s<i)<latexit sha1_base64="VqQRFXP+08qzrlkHSaw0DwFTEMY=">AAACcnicbVFdaxNBFJ2sVWv8aKpv9sHRIKRQwm4tVLBCsVB8KVZq2kASlruT2WTofDEzqw3j/Bh/ja/66P/wB3SyLmJbLwyce8693HvPFJoz69L0Vyu5tXL7zt3Ve+37Dx4+WuusPz61qjKEDojiygwLsJQzSQeOOU6H2lAQBadnxfnBUj/7TI1lSn5yC00nAmaSlYyAi1TeeaN7flyU2IavuAYXYRO/xWNt1DRnWPdszv4qW7ipzf0eC5t5p5v20zrwTZA1oIuaOM7XW8/GU0UqQaUjHKwdZal2Ew/GMcJpaI8rSzWQc5jRUYQSBLUTX18Z8MvITHGpTHzS4Zr9t8ODsHYhilgpwM3tdW1J/k8bVa58PfFM6spRSf4MKiuOncJLy/CUGUocX0QAxLC4KyZzMEBcNPbKFCIKw2ZzFy85oe4kLqP4YVzWczAzauvcBi+VEcCD/3A0DJ59URKCF5EO7Whpdt3Am+B0u5+96m9/3Onuv2vMXUUb6AXqoQzton30Hh2jASLoG/qOfqCfrd/J0+R50vxE0mp6nqArkWxdAjANwA8=</latexit>

Node features: Dihedral angles of backbone

Relations

Distance Direction Rotation

Point cloud with local frames

Why so low? Sequences in test are from different fold topologiesSignificant boost in statistical performance vs other neural method

Simpler message passing aggregation offers room for improvement (Thanks, reviewer!)Goal: Expressive but invariant

descriptors

Documents

Generative Models for Graph-Based Protein Designpeople.csail.mit.edu/ingraham/slides/neurips19_protein_design.pdf · used for protein engineering. We found the average perplexity