MACHINE LEARNING FOR PREDICTION OF POLY-SPECIFICITY FROM SEQUENCE Tushar … · 2018. 7. 17. ·...

Preview:

Citation preview

MACHINE LEARNING FOR

PREDICTION OF POLY-SPECIFICITY

FROM SEQUENCE

Tushar Jain

June 27, 2018

2

LESSONS FROM CLINICAL ANTIBODIES

HIGH THROUGHPUT EXPERIMENTAL SURROGATES FOR PK

MACHINE LEARNING METHODOLOGY

PREDICTING POLY-SPECIFICIT Y

CONCLUSIONS

3COPYRIGHT | © 2018 Ad imab, LLC

PROPERTIES OF CLINICAL ANTIBODIES

• Twelve biophysical measurements cluster into related subgroups

• Poly-specificity - PSR, CSI, ACSINS, CIC

• Poly-specificity - BVP, ELISA

• Hydrophobicity - HIC, SMAC, SGAC100

• Stability - Titer, Tm

Jain et al., PNAS, 2017

4COPYRIGHT | © 2018 Ad imab, LLC 4

SOLUBILIZED MEMBRANE PROTEINS AS A POLY-SPECIFICITY

REAGENT (PSR)

Cell ExtractEnriched Membranes

Detergent

Lyse the cells

without detergent

Random biotinylation

Membrane Proteins

Non-target Cell

Cytosolic Proteins

Solubilized Membrane

Proteins (SMP)

Biotinylated SMPs can be used as screening

(one off) or selection (batch) tool

Xu et al., PEDS, 2013

5

LESSONS FROM CLINICAL ANTIBODIES

HIGH THROUGHPUT EXPERIMENTAL SURROGATES FOR PK

MACHINE LEARNING METHODOLOGY

PREDICTING POLY-SPECIFICIT Y

CONCLUSIONS

6COPYRIGHT | © 2018 Ad imab, LLC 6

POLY-SPECIFICITY AS AN INDICATOR FOR POOR PK

Kelly et al., mAbs, 2015Hotzel et al., mAbs, 2012

7COPYRIGHT | © 2018 Ad imab, LLC 7

FCRN BINDING AS A PREDICTOR OF POOR PKB R I A K I N U M A B V S U S T E K I N U M A B

Schoch et al., PNAS, 2015

Jain et al., PNAS, 2017

FcRn retention time(RT) data from Kettenberger et al.

Kelly et al., mAbs, 2016

FcRn knockout

mouse

8COPYRIGHT | © 2018 Ad imab, LLC 8

BEHAVIOR IN SEVERAL ASSAYS CORRELATED WITH

ACCELERATED CLEARANCE

Avery et al., mAbs, 2018

9COPYRIGHT | © 2018 Ad imab, LLC 9

PREDICTION OF 25% BOTTOM MABS IN AN ASSAY USING

ANOTHER MEASUREMENT

Areas under ROC curve

Predict bottom 25% of

PSR using other assay

measurements

Predict bottom 25% of

other assays using PSR

Jain et al., PNAS, 2017

FcRn and Heparin retention time

(RT) data from Kettenberger et al.

10COPYRIGHT | © 2018 Ad imab, LLC 10

PREDICTION OF 25% BOTTOM MABS IN AN ASSAY USING

ANOTHER MEASUREMENT

Areas under ROC curve ROC for predicting bottom 25%

FcRn RT using PSR assay

AUC : 0.85

N = 133

11

LESSONS FROM CLINICAL ANTIBODIES

HIGH THROUGHPUT EXPERIMENTAL SURROGATES FOR PK

MACHINE LEARNING METHODOLOGY

PREDICTING POLY-SPECIFICIT Y

CONCLUSIONS

12COPYRIGHT | © 2018 Ad imab, LLC

MODELS FOR DEVELOPABILITY PREDICTION

INPUT ANTIBODY DATA

SEQUENCE

• Aligned antibody sequences

• Germline information

• CDR lengths, etc

• Amino-acid property scales

• Hydrophobicity

• Size, charge, etc

STRUCTURAL PROPERTIES

• Structural metrics important for developability assay under consideration

• Solvent-accessible surface-area (SASA)

• Residue contact probabilities

• Local flexibility

• Isoelectric point, etc

MACHINE LEARNING ALGORITHMS

• Logistic Regression with LASSO regularization

• Tree-based method: XGBoost

• Feed-forward neural networks

13COPYRIGHT | © 2018 Ad imab, LLC 13

EXAMPLE OF PREDICTING SASA FROM SEQUENCE

Computed fractional SASA from PDBs

Estim

ate

d fra

ctional S

AS

ARMSE = 9.8% 9% 14.6%

8.2% 8.3% 8.9%

Jain et al., Bioinformatics, 2017

Yang et al., mAbs, 2017

14COPYRIGHT | © 2018 Ad imab, LLC

ENCODING ANTIBODY DATA FOR MACHINE LEARNING

H1 H113 A R N D ……….V W Y A R N D …….... V W Y A R N D …….... V W Y

CDR H1

SASA

CDR H2

SASA

CDR H3

SASAAligned

sequences

0.1 3.2 0.4 2.2 1.9 4.1 2.2

0.0 1.0 2.4 0.2 1.7 2.1 6.2

0.4 3.5 1.4 1.2 0.9 1.1 1.2

0.1 0.2 3.1 2.2 1.2 2.0 4.2

1.1 1.2 3.4 3.2 1.7 2.1 6.2

1.0 0.1 4.4 1.2 0.0 0.1 2.2

+ LC information

CDR

lengthsVHF

10 17 14

12 17 11

VH1

VH4

HC information

0

1

Desirable?

15

LESSONS FROM CLINICAL ANTIBODIES

HIGH THROUGHPUT EXPERIMENTAL SURROGATES FOR PK

MACHINE LEARNING METHODOLOGY

PREDICTING POLY-SPECIFICITY

CONCLUSIONS

16COPYRIGHT | © 2018 Ad imab, LLC 16

PREDICTION OF ANTIBODIES WITH POOR PSR SCORES

• Different machine learning methods perform comparably, though XGBoost is

slightly better

• Simpler models using only aggregated SASA in CDRs by amino-acid type

show reasonable performance when compared to models with full sequence

information

• Logistic regression on the SASA models enables assessment of amino

acid propensities for poly-specificity

Experimental PSR data

• ~30000 antibodies with

~13000 distinct H3s

• Training and test splits done

on the basis of H3s

Area under ROC curve10-fold cross-validation

ModelPSR Score >0.1

XGBoostLogistic

Regression

Neural

Network

Sequence 0.74 0.74 0.74

Sequence

+ AA SASA

per CDR

0.77 0.74 0.76

AA SASA

per CDR0.76 0.72 0.72

17COPYRIGHT | © 2018 Ad imab, LLC 17

AMINO-ACID COEFFICIENTS FROM LOGISTIC REGRESSION FOR

PREDICTION OF PSR>0.1

Aromatic and positively-charged amino-acids show propensity for poor PSR

18COPYRIGHT | © 2018 Ad imab, LLC 18

ELECTROSTATIC POTENTIAL MAPPED ONTO MAB SURFACE

basiliximabbococizumab guselkumab

gevokizumab ibalizumabranibizumab

High

PSR

Low

PSR

APBS electrostatics

Large positive patches seen in mAbs showing binding to PSR

19COPYRIGHT | © 2018 Ad imab, LLC 19

CONCLUSIONS

• Training on known crystal structures enables prediction of structural metrics from sequence

• Machine learning methods can successfully predict, from sequence, antibodies exhibiting poor behavior in these assays

• Cross-validation AUCs for PSR assay is 0.72 - 0.77

• Amino-acid propensities determined for PSR correlate with observations from other studies

• Predictions from sequence enable:

• Rapid predictions on millions of sequences to help design libraries enriched in the desired biophysical properties

• Improving lead clones, since determined amino-acid coefficients can identify individual positions that contribute to unfavorable developability

THANK YOU

21COPYRIGHT | © 2018 Ad imab, LLC 21

LEARNING STRUCTURAL PROPERTIES FROM SEQUENCE

For each position i along the sequence,

where, Pi = structural property of amino-acid i

e.g. SASA

aai = amino-acid type at i,

aan1…nN = amino-acid types at N neighbors,

VHF, VLF = heavy and light chain germline family

CDR lengths

𝑃𝑖 = 𝑓 𝑎𝑎𝑖, 𝑎𝑎𝑛1,…,𝑎𝑎𝑛𝑁, 𝑉𝐻𝐹, 𝑉𝐿𝐹, 𝐶𝐷𝑅 𝑙𝑒𝑛𝑔𝑡ℎ𝑠

Global

LocalSequence

information

Train models using a database of ~1200 antibody structures curated from the PDB