Upload
rguha
View
130
Download
3
Embed Size (px)
Citation preview
What can your library do for you?
Rajarshi Guha, Dac-Trung Nguyen,Alexey Zhakarov, Ajit Jadhav
NIH NCATS
ACS Fall Meeting 2016, Philadelphia
August 21, 2016
Library Design
I Historical collections and assay data provide information onhow a set of compounds has faired
I Use (dis)similarity and machine learning to construct newcollections that show similar behavior
I Plus various constraints
I Libraries can be designed for certain target families or specificscreening paradigms
If sufficiently annotated, compound behaviormay be correlated to assay and biology char-acteristics
Library Design
I Historical collections and assay data provide information onhow a set of compounds has faired
I Use (dis)similarity and machine learning to construct newcollections that show similar behavior
I Plus various constraints
I Libraries can be designed for certain target families or specificscreening paradigms
If sufficiently annotated, compound behaviormay be correlated to assay and biology char-acteristics
Two Questions
How likely are compounds, associated with a given annotation,identified as active?
Given a new set of compounds, what sets of assay conditions (asimplied by the annotations) will they be active in?
BAO 2.0
Assay Modeling
Prior Work
I BAO annotated datasetsI de Souza et al, 2014; Vempati et al, 2012
I Analyzing HTS datasets using BAOI Zander-Balderud et al, 2015; Schurer et al, 2011
I Semi-automated annotation of assay descriptions using theBAO
I Clark et al, 2014
Workflow
I Extract unique BAO terms and for each term identifyannotated assays
I Extract active compounds from this set of assays
I Compute fingerprint bit distribution
I Use these conditional bit distributions to identify the BAOterms that describe the assay that they are likely to be activein
Dataset Overview
I Extracted 4010 Pubchem AIDs from BARD
I Primary, confirmation, counterscreening assays
I 154M outcomes
I 740K compounds
I Pubchem 881-bit keys using CDK and NCGC implementations
I 192 unique BAO terms
Dataset Overview
1e+02
1e+05
1e+08
Active Inactive Inconclusive Probe Unspecified
Num
ber
of o
utco
mes
0
200
400
0 2 4 6log Num. Compound
Den
sity
(as
says
)
Outcome
Active
Inactive
0
300
600
900
2 3 4 5 6 7 8 9 10 11 12 13Term Depth
Num
ber
of T
erm
s
0
1000
2000
1 2 3 4 5 6 7 8 9 13Number of BAO Terms
Num
ber
of A
ssay
s
Class Imbalances
Imbalanced classes are problematic, and some ofthe terms with near-balanced classes are not veryspecific (e.g., imaging method)
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●●●
●●
●●
●●
●●
●
●
●
●
VICTOR X2 Multilabel Plate Readerimaging method
radiometry method
SpectraMax 190 Microplate Reader
reporter protein
thermal shift
0
3
6
9
BAO ID
Cla
ss R
atio
Problem formulationFor a given library of compounds X , we would like to calculate aranked list relevant T of BAO terms that are most likely associatedwith X . Let x ∈ X and t is a BAO term. The list T is an orderedlist based on the following:
argmaxi
∑j
p(ti |xj)
, (1)
where p(ti |xj) is the probability that BAO term ti is associatedwith compound xj . From Bayes’ rule, we have
p(ti |xj) =p(xj |ti )p(ti )
p(xj)or p(ti |xj) ∝ p(xj |ti )p(ti ).
Given that BAO terms are annotated at the assay level, we insteadhave
p(ti |xj) ∝ p(ti )∑k
p(xj |ak)p(ak |ti ), (2)
where ak is a BAO annotated assay.
A Bayesian Approach for Ranking
Note that p(xj |ak) is the sampling function specified over onlyactive compounds in assay ak . In our model, xj is defined asindependent Bernoulli distribution with parameter θ, i.e.,
p(xj |ak) =∏i
θxjii (1− θi )1−xji ,
where xji ∈ {0, 1} is the i-th bit of the PubChem substructuralfingerprint.
Learning BAO terms for a library of compounds amounts toestimating θ, p(ti ), and p(ak |ti ).
Per-Term Activity Classifier
I For a given ontology term Ti , predict whether a compoundwill be active or not
I Model this using Naıve Bayes, where we extract set of activesand inactives from assays annotated with Ti
I Results in a set of models {M1,M2, · · · ,MN},
I For a new compound in library, obtain probability of beingactive for term Ti for all i and take top k terms
I Aggregate top k terms from all compounds in library
I Represents the set of ontology terms defining an assayin which these compounds would likely be active
Test Libraries
I Considered several libraries to test out the approach
I MIPE (1912 compounds) - Approved, investigational drugs,constructed for functional diversity
I LOPAC (1280 compounds) - Diverse library, designed forenrichment of bioactivity
I Natural Products (5000 compounds)
I 1000 member subset of ChEMBL GPCR collection
I 1000 member subset of ChEMBL Kinase collection
Test Libraries
In the Pubchem fingerprint space, the libraries are not very different
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
gpcr1gpcr2
kinase1kinase2
lopacm
ipenp
Bit Position
Nor
mal
ized
Fre
quen
cy
Test Libraries - Distance Matrix
lopac
mipe
np
kinase1
kinase2
gpcr1
gpcr2
lopac
mipe np
kinas
e1
kinas
e2gp
cr1gp
cr2
0
1
2
3
4
Euc. Dist.
Prediction Workflow
Bayesian Ranking
I Compute liklihood of allterms for each compound
I Aggregate across library(mean liklihood) and taketop k
Activity Models
I For molecules predictedactive, collect correspondingterms
I Retain the top k mostfrequent terms across thelibrary
We take the top k terms as the set of anno-tations describing an assay in which the librarywill show activity in
Result - Bayesian Ranking
BAO_0000001BAO_0000002BAO_0000003BAO_0000004BAO_0000005BAO_0000010BAO_0000015BAO_0000035BAO_0000045BAO_0000046BAO_0000049BAO_0000050BAO_0000051BAO_0000055BAO_0000057BAO_0000062BAO_0000063BAO_0000070BAO_0000079BAO_0000080BAO_0000100BAO_0000123BAO_0000129BAO_0000130BAO_0000131BAO_0000139BAO_0000142BAO_0000152BAO_0000160BAO_0000164BAO_0000166BAO_0000217BAO_0000218BAO_0000219BAO_0000220BAO_0000221BAO_0000223BAO_0000224BAO_0000225BAO_0000249BAO_0000250BAO_0000251BAO_0000254BAO_0000357BAO_0000363BAO_0000366BAO_0000394BAO_0000405BAO_0000450BAO_0000452BAO_0000453BAO_0000508BAO_0000512BAO_0000513BAO_0000515BAO_0000516BAO_0000572BAO_0000577BAO_0000578BAO_0000591BAO_0000593BAO_0000657BAO_0000682BAO_0000691BAO_0000697BAO_0000698BAO_0000699BAO_0000701BAO_0000705BAO_0000706BAO_0000722BAO_0000850BAO_0000884BAO_0000902BAO_0000903BAO_0000904BAO_0000905BAO_0000906BAO_0000913BAO_0000943BAO_0000982BAO_0001019BAO_0001036BAO_0001046BAO_0001047BAO_0001104BAO_0002001BAO_0002041BAO_0002043BAO_0002090BAO_0002100BAO_0002168BAO_0002176BAO_0002182BAO_0002188BAO_0002196BAO_0002424BAO_0002527BAO_0002528BAO_0002530BAO_0002534BAO_0002656BAO_0002989BAO_0002990BAO_0002991BAO_0002993BAO_0002994BAO_0002995BAO_0002996BAO_0002997BAO_0002998BAO_0003000BAO_0003002BAO_0003003BAO_0003004BAO_0003005BAO_0003006BAO_0003007BAO_0003008BAO_0003009BAO_0003010BAO_0003063BAO_0003064BAO_0003069
lopac
mipe NP
−5e+09
−4e+09
−3e+09
−2e+09
−1e+09
Avg Prob
Result - Per Term Activity Classifier
bioluminescence
molecular redistribution determination method
direct enzyme activity measurement method
whole cell lysate format
BAO_0000722
lopac
mipe np
kinas
e1
kinas
e2gp
cr1gp
cr2
Rank
1
2
3
4
5
What’s Different Between Libraries?
What’s Different Between Libraries?
Term Depth for the ’Differential’ Terms
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●●
●●
●
●
●●●● ●
●
3
4
5
6
7
8
3 4 5 6 7 8Term Depth (LOPAC)
Term
Dep
th (
MIP
E)
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●●
●●
●
●
● ●●● ●
●
3
4
5
6
7
8
3 4 5 6 7 8Term Depth (NP)
Term
Dep
th (
MIP
E)
Pitfalls
If sufficiently annotated, compound behavior maybe correlated to assay and biology characteristics
I A very abstract, possibly lossy, view of the effect ofcompounds on biology
I Depends on correct and meaningful annotations
I Annotations terms are context dependent, but this may notbe considered when annotating a dataset
I BAO terms exhibit hierarchical relationships and ignoringthem is simplistic
Acknowledgements
I Qiong Cheng (U. Miami)
I Stephan Schurer (U. Miami)
Source code and slides