What can your library do for you?

What can your library do for you?

Rajarshi Guha, Dac-Trung Nguyen,Alexey Zhakarov, Ajit Jadhav

NIH NCATS

ACS Fall Meeting 2016, Philadelphia

August 21, 2016

Library Design

I Historical collections and assay data provide information onhow a set of compounds has faired

I Use (dis)similarity and machine learning to construct newcollections that show similar behavior

I Plus various constraints

I Libraries can be designed for certain target families or specificscreening paradigms

If sufficiently annotated, compound behaviormay be correlated to assay and biology char-acteristics

Library Design

I Historical collections and assay data provide information onhow a set of compounds has faired

I Use (dis)similarity and machine learning to construct newcollections that show similar behavior

I Plus various constraints

I Libraries can be designed for certain target families or specificscreening paradigms

If sufficiently annotated, compound behaviormay be correlated to assay and biology char-acteristics

Two Questions

How likely are compounds, associated with a given annotation,identified as active?

Given a new set of compounds, what sets of assay conditions (asimplied by the annotations) will they be active in?

BAO 2.0

Assay Modeling

Prior Work

I BAO annotated datasetsI de Souza et al, 2014; Vempati et al, 2012

I Analyzing HTS datasets using BAOI Zander-Balderud et al, 2015; Schurer et al, 2011

I Semi-automated annotation of assay descriptions using theBAO

I Clark et al, 2014

http://www.ncbi.nlm.nih.gov/pubmed/24441647





Workflow

I Extract unique BAO terms and for each term identifyannotated assays

I Extract active compounds from this set of assays

I Compute fingerprint bit distribution

I Use these conditional bit distributions to identify the BAOterms that describe the assay that they are likely to be activein

Dataset Overview

I Extracted 4010 Pubchem AIDs from BARD

I Primary, confirmation, counterscreening assays

I 154M outcomes

I 740K compounds

I Pubchem 881-bit keys using CDK and NCGC implementations

I 192 unique BAO terms

http://cdk.github.io/cdk/1.5/docs/api/org/openscience/cdk/fingerprint/PubchemFingerprinter.html

https://bitbucket.org/caodac/pcfp

Dataset Overview

1e+02

1e+05

1e+08

Active Inactive Inconclusive Probe Unspecified

Num

ber

of o

utco

mes

0

200

400

0 2 4 6log Num. Compound

Den

sity

(as

says

)

Outcome

Active

Inactive

0

300

600

900

2 3 4 5 6 7 8 9 10 11 12 13Term Depth

Num

ber

of T

erm

s

0

1000

2000

1 2 3 4 5 6 7 8 9 13Number of BAO Terms

Num

ber

of A

ssay

s

Class Imbalances

Imbalanced classes are problematic, and some ofthe terms with near-balanced classes are not veryspecific (e.g., imaging method)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●

●●

●●

●●

●

●

●

●

VICTOR X2 Multilabel Plate Readerimaging method

radiometry method

SpectraMax 190 Microplate Reader

reporter protein

thermal shift

0

3

6

9

BAO ID

Cla

ss R

atio

Problem formulationFor a given library of compounds X , we would like to calculate aranked list relevant T of BAO terms that are most likely associatedwith X . Let x ∈ X and t is a BAO term. The list T is an orderedlist based on the following:

argmaxi

∑j

p(ti |xj)

, (1)

where p(ti |xj) is the probability that BAO term ti is associatedwith compound xj . From Bayes’ rule, we have

p(ti |xj) =p(xj |ti )p(ti )

p(xj)or p(ti |xj) ∝ p(xj |ti )p(ti ).

Given that BAO terms are annotated at the assay level, we insteadhave

p(ti |xj) ∝ p(ti )∑k

p(xj |ak)p(ak |ti ), (2)

where ak is a BAO annotated assay.

A Bayesian Approach for Ranking

Note that p(xj |ak) is the sampling function specified over onlyactive compounds in assay ak . In our model, xj is defined asindependent Bernoulli distribution with parameter θ, i.e.,

p(xj |ak) =∏i

θxjii (1− θi )1−xji ,

where xji ∈ {0, 1} is the i-th bit of the PubChem substructuralfingerprint.

Learning BAO terms for a library of compounds amounts toestimating θ, p(ti ), and p(ak |ti ).

Per-Term Activity Classifier

I For a given ontology term Ti , predict whether a compoundwill be active or not

I Model this using Naıve Bayes, where we extract set of activesand inactives from assays annotated with Ti

I Results in a set of models {M1,M2, · · · ,MN},

I For a new compound in library, obtain probability of beingactive for term Ti for all i and take top k terms

I Aggregate top k terms from all compounds in library

I Represents the set of ontology terms defining an assayin which these compounds would likely be active

Test Libraries

I Considered several libraries to test out the approach

I MIPE (1912 compounds) - Approved, investigational drugs,constructed for functional diversity

I LOPAC (1280 compounds) - Diverse library, designed forenrichment of bioactivity

I Natural Products (5000 compounds)

I 1000 member subset of ChEMBL GPCR collection

I 1000 member subset of ChEMBL Kinase collection

Test Libraries

In the Pubchem fingerprint space, the libraries are not very different

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

gpcr1gpcr2

kinase1kinase2

lopacm

ipenp

Bit Position

Nor

mal

ized

Fre

quen

cy

Test Libraries - Distance Matrix

lopac

mipe

np

kinase1

kinase2

gpcr1

gpcr2

lopac

mipe np

kinas

e1

kinas

e2gp

cr1gp

cr2

0

1

2

3

4

Euc. Dist.

Prediction Workflow

Bayesian Ranking

I Compute liklihood of allterms for each compound

I Aggregate across library(mean liklihood) and taketop k

Activity Models

I For molecules predictedactive, collect correspondingterms

I Retain the top k mostfrequent terms across thelibrary

We take the top k terms as the set of anno-tations describing an assay in which the librarywill show activity in

Result - Bayesian Ranking

BAO_0000001BAO_0000002BAO_0000003BAO_0000004BAO_0000005BAO_0000010BAO_0000015BAO_0000035BAO_0000045BAO_0000046BAO_0000049BAO_0000050BAO_0000051BAO_0000055BAO_0000057BAO_0000062BAO_0000063BAO_0000070BAO_0000079BAO_0000080BAO_0000100BAO_0000123BAO_0000129BAO_0000130BAO_0000131BAO_0000139BAO_0000142BAO_0000152BAO_0000160BAO_0000164BAO_0000166BAO_0000217BAO_0000218BAO_0000219BAO_0000220BAO_0000221BAO_0000223BAO_0000224BAO_0000225BAO_0000249BAO_0000250BAO_0000251BAO_0000254BAO_0000357BAO_0000363BAO_0000366BAO_0000394BAO_0000405BAO_0000450BAO_0000452BAO_0000453BAO_0000508BAO_0000512BAO_0000513BAO_0000515BAO_0000516BAO_0000572BAO_0000577BAO_0000578BAO_0000591BAO_0000593BAO_0000657BAO_0000682BAO_0000691BAO_0000697BAO_0000698BAO_0000699BAO_0000701BAO_0000705BAO_0000706BAO_0000722BAO_0000850BAO_0000884BAO_0000902BAO_0000903BAO_0000904BAO_0000905BAO_0000906BAO_0000913BAO_0000943BAO_0000982BAO_0001019BAO_0001036BAO_0001046BAO_0001047BAO_0001104BAO_0002001BAO_0002041BAO_0002043BAO_0002090BAO_0002100BAO_0002168BAO_0002176BAO_0002182BAO_0002188BAO_0002196BAO_0002424BAO_0002527BAO_0002528BAO_0002530BAO_0002534BAO_0002656BAO_0002989BAO_0002990BAO_0002991BAO_0002993BAO_0002994BAO_0002995BAO_0002996BAO_0002997BAO_0002998BAO_0003000BAO_0003002BAO_0003003BAO_0003004BAO_0003005BAO_0003006BAO_0003007BAO_0003008BAO_0003009BAO_0003010BAO_0003063BAO_0003064BAO_0003069

lopac

mipe NP

−5e+09

−4e+09

−3e+09

−2e+09

−1e+09

Avg Prob

Result - Per Term Activity Classifier

bioluminescence

molecular redistribution determination method

direct enzyme activity measurement method

whole cell lysate format

BAO_0000722

lopac

mipe np

kinas

e1

kinas

e2gp

cr1gp

cr2

Rank

1

2

3

4

5

What’s Different Between Libraries?

What’s Different Between Libraries?

Term Depth for the ’Differential’ Terms

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●

●●

●●

●

●

●●●● ●

●

3

4

5

6

7

8

3 4 5 6 7 8Term Depth (LOPAC)

Term

Dep

th (

MIP

E)

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●

●●

●●

●

●

● ●●● ●

●

3

4

5

6

7

8

3 4 5 6 7 8Term Depth (NP)

Term

Dep

th (

MIP

E)

Pitfalls

If sufficiently annotated, compound behavior maybe correlated to assay and biology characteristics

I A very abstract, possibly lossy, view of the effect ofcompounds on biology

I Depends on correct and meaningful annotations

I Annotations terms are context dependent, but this may notbe considered when annotating a dataset

I BAO terms exhibit hierarchical relationships and ignoringthem is simplistic

Acknowledgements

I Qiong Cheng (U. Miami)

I Stephan Schurer (U. Miami)

Source code and slides

https://spotlite.nih.gov/ncats/acs-fall-2016/tree/master

Technology

What can your library do for you?