Data Mining in Bioinformatics Day 9: Graph Mining in ... · Chemoinformatics Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 8 The chemical space 1060 possible small or-ganic

Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 9: Graph Mining in Chemoinformatics

Chloé-Agathe Azencott & Karsten Borgwardt

February 18 to March 1, 2013

Machine Learning & Computational Biology Research GroupMax Planck Institutes Tübingen andEberhard Karls Universität Tübingen

Drug discovery


Modern therapeutic researchFrom serendipity to rationalized drug design

Ancient Greeks treatinfections with mould

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Biapenem in PBP-1A

Drug discovery process


1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

hits

4. Lead optimization

and synthesis

5. Assay

Protein that we want to inhibit so as to interfer with a biological process

Compounds likely to bind to the target

Can they be drugs? (ADME-Tox)

- in vitro- in vivo- clinical

- bioactivity- pharmacokinetics- synthetic pathway



52 months 90 months

1. Find a target

2. Identifyhits


hits


and synthesis

5. Assay



$500,000,000to

$2,000,000,000

52 months 90 months

1. Find a target

2. Identifyhits


hits


and synthesis

5. Assay

Chemoinformatics


How can computer science help?→ Chemoinformatics!

“...the mixing of information resources to transform data into informa-tion, and information into knowledge, for the intended purpose of mak-ing better decisions faster in the arena of drug lead identification andoptimisation.” – F. K. Brown

“... the application of informatics methods to solve chemical problems.”– J. Gasteiger and T. Engel

Chemoinformatics


Chemoinformatics

1. Find a target

2. Identifyhits


hits


and synthesis

5. Assay

Chemoinformatics


The chemical space

1060 possible small or-ganic molecules

1022 stars in the observ-able universe

(Slide courtesy of Matthew A. Kayala)



QSARQSPR

1. Find a target

2. Identifyhits


hits


and synthesis

5. Assay

QSAR: Qualitative Structure-Activity Relationshipi.e. classification

QSPR: Quantititive Structure-Property Relationshipi.e. regression

Representing chemicals in silico


Expert knowledge molecular descriptors→ hard, potentially incomplete

Molecules are...

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Representing chemicals in silico


Similar Property PrincipleMolecules having similar structures should exhibit similaractivities.

→ Structure-based representationsCompare molecules by comparing substructures

Molecular graph


C

O

N C

C

C

N

O

S

C

C

O O

C

C

d

d

d

C

C

NC

C

C

C

C

CO

Undirected labeled graph

Fingerprints


Define feature vectors that record the presence/absence(or number of occurrences) of particular patterns in a givenmolecular graph

φ(A) = (φs(A))s substructure

whereφs(A) =

{1 if s occurs in A0 otherwise

Extension of traditional chemical fingerprints

Fingerprints


Learning from fingerprintsClassical machine learning and data mining techniquescan be applied to these vectorial feature representations.

Any distance / kernel can be usedClassificationFeature selectionClustering

Fingerprints


Fingerprints compressionSystematic enumeration→ long, sparse vectorse.g. 50, 000 random compounds from ChemDB→ 300, 000 paths of length up to 8→ 300 non-zeros on average“Naive” Compression

List the positions of the 1s219 = 524, 288average encoding: 300× 19 = 5, 700 bits

Fingerprints


Fingerprints compressionModulo Compression (lossy)

Elias-Gamma Monotone Encoding (lossless)[Baldi et al., 2007]

index j → blog(j)c 0 bits + binary encoding of jji < ji+1: blog(ji+1)c → blog(ji)− log(ji+1)caverage compressed size = 1, 800 bits

Frequent patterns fingerprints


MOLFEA [Helma et al., 2004]

P = positive (mutagenic) compoundsN = negative compounds

features: fragments (= patterns) f such thatboth freq(f,P) ≥ t and freq(f,N) ≥ t

Limited to frequent linear patterns

ML algorithm: SVM with linear or quadratic kernel

Frequent patterns fingerprints


MOLFEA [Helma et al., 2004]

CPDB – Carcinogenic Potency DataBase684 compounds classified in 341 mutagens and 343 non-mutagens according to Ames test on Salmonella

1% 3% 5% 10%Frequency threshold

50

60

70

80

90

100

Cross-validated sensitivity

Mutagenicity prediction [Hema04]

Linear kernelQuadratic kernel

Spectrum kernels


φ(A) = (φs(A))s∈S

Kspectrum(A,A′) = k(φ(A), φ(A′))

k ∈ RR|(S)|×R|(S)| can beDot product (linear kernel)

RBF kernel

Tanimoto kernel: k(A,B) = A∩BA∪B

MinMax kernel:∑N

i=1min(Ai,Bi)∑Ni=1max(Ai,Bi)

Spectrum kernels


Tanimoto and MinMaxBoth Tanimoto and Minmax are kernels.

Proof for Tanimoto: J.C. Gower A general coefficientof similarity and some of its properties. Biometrics1971.Proof for MinMax:

MinMax(x, y) =〈φ(x), φ(y)〉

〈φ(x), φ(x)〉 + 〈φ(y), φ(y)〉 − 〈φ(x), φ(y)〉with φ(x) of length: # patterns × max countφ(x)i = 1 iff. the pattern indexed by bi/qc appears morethan i mod q times in x

All patterns fingerprints


Paths fingerprintsLabeled sub-paths (walks)

O

N C C

N

O

S

C

C

O O

C

C

d

d

d

C

C

NsCsCsS

CsCsCdO

C

C

NC

C

C

C

C

CO

Some sub-paths of length 3



Circular fingerprintsLabeled sub-trees - Extended-Connectivity (or Circular)features

O

N C C

N

O

S

C

C

O O

C

Cd

d

d

C

C

C{sC{sN|sC}|sN{sC}|sS{sC}}

C

C

NC

C

C

C

C

CO

Example of a circular substructure of depth 2



2D spectrum kernels [Azencott et al., 2007]

Systematically extract paths / circular fingerprints,for various maximal depthsSVM with Tanimoto / Minmax



2D spectrum kernels [Azencott et al., 2007]

Mutagenicity (Mutag): 188 compounds

Benzodiazepine receptor affinity (BZR): 181+125 compounds

Cyclooxygenase-2 ihibitors (COX2): 178 + 125 compounds

Estrogen receptor affinity (ER): 166 + 180 compounds

Data SVM Previous bestMutag 90.4% 85.2% (gBoost)BZR 79.8% 76.4%

COX2 70.1% 73.6%

ER 82.1% 79.8%

Informative patterns


Extract informative patterns while learningAll patterns + sparsity regularizationgBoost

gBoost


[Saigo et al., 2009]

Train data: {(Gn, yn)}n=1...l

Stump or hypothesis: h(x : t, ω) = ω(2xt − 1)xt = 1 if xt ⊆ G and 0 otherwise

Decision function:

f (x) =∑

t∈T,ω∈{−1,+1}

αtω h(x : t, ω)

Equivalent to solving (LP-Boost)

minλ,γ

γ

such that∑l

n=1 λnynh(xn : t, ω) ≤ γ ∀t, ω∑ln=1 λn = 1 and 0 ≤ λn ≤ D ∀n

gBoost



Solve by “column generation”

start with H = ∅ and λn = 1/l ∀nIteratively:

find (t∗, ω∗) that maximizes

g(t, ω) =l∑

n=1

λnynh(xn; t, ω)

add (t∗, ω∗) to H

update λn, γStop when 6 ∃(t∗, w∗) such that g(t∗, w∗) > γ + ε

gBoost



Finding (t∗, w∗): DFS code treePruning condition (g∗ optimal gain so far):if

max

2∑

n:yn=1,t⊆Gn

λn −l∑

n=1

ynλn, 2∑

n:yn=−1,t⊆Gn

λn +l∑

n=1

ynλn

< g∗

then ∀t′ : t ⊆ t′,∀w′, g∗ > g(t′, w′)

gBoost



Application to CPDB

Accuracy similar toHelma et al. (79%)

Most discriminativepatterns

Weisfeiler-Lehman kernel


[Shervashidze et al., 2011]

Goal: scalability

Compute a sequence that captures topological and labelinformation of graphs in a runtime linear in the number ofedges

→ sub-tree kernel

Weisfeiler-Lehman kernel


[Shervashidze et al., 2011]

Convolution kernels


a.k.a. decomposition kernels(x1, . . . , xD) is a tuple of parts of x, with xd ∈ X for eachpart d = 1, . . . , D

kd ∈ RXd×Xd: a Mercer kernel

Kdecomposition(x, x′) =

∑x1x2...xD=x

∑x′1x′2x′D=x

′

k1(x1, x′1)k2(x2, x

′2) . . . kD(xD, x

′D)

Spectrum kernels are a particular case of convolutionkernels

Convolution kernels


Weighted Decomposition Kernel [Menchetti et al., 2005]

Match atoms and weigh them according to a kernel between sub-graphs that include these atoms

KWDK(x, x′) =

∑(a,σ∈Dr(x))

∑(a′,σ′∈Dr(x′)) δ(a, a

′)Kc(σ, σ′)

r > 0 ∈ N

Dr(x): decompositions of the molecular graph of x in an atom a

and a subpath σ of x including a and of depth at most r

Convolution kernels


Weighted Decomposition Kernel [Menchetti et al., 2005]

Kc: contextual kernel, here: histogram intersection kernel

Kc(σ, σ′) =

∑l∈L min(fσ(l), fσ′(l))

L: possible labels for edges and vertices

fσ(l): frequency of label l subgraph σ.

Optimal assignment kernels


Try to best map x and x’Not necessarily a kernelin practice: K ← K − λminI



The Local Atom Pair kernel [Hinselmann et al., 2010]

M : pairwise intramolecular matrix of inter-atomic topological dis-tances

Local atom environment: l(i) = {(L(i),Mij,L(j)), j ∼ i}κ(i, j): dot product, Tanimoto or MinMax between l(i) and l(j)



LAP: performance [Hinselmann et al., 2010]

Introducing spatial information


3D Histograms [Azencott et al., 2007]

Groups of k atoms

Associated size:

Pairwise distances(k = 2)diameter of the smallestsphere that contains allk atoms



3D Histograms [Azencott et al., 2007]

One histogram per class of k-tuple (e.g. C-C-C, C-C-O)

C

O

N C

C

C

N

O

S

C

C

O O

C

C

C2.2

4.6

3.2

5.6

6.7

2.4

2.6

3.7

0 1 2 3 4 5 6 7

Frequency of N-O

N-O distance (A)

C

NC

C

C

C

C

CO 6.3

6.6

9.2 2.7

5.7

7.9

9.5

8 9 10

1

2

3

4

0



3D Histograms: performance [Azencott et al., 2007]

Data 2D kernel Hist3D kernelMutag 90.4% 88.8%

BZR (loo) 82.0% 79.4%ER (loo) 87.0% 86.1%COX2 76.9% 78.6%



3D Decomposition Kernels [Ceroni et al., 2007]

Remember: KWDK(x, x′) =

∑(a,σ∈Dr(x))

∑(a′,σ′∈Dr(x′))

δ(a, a′)Kc(σ, σ′)

K3DDK(x, x′) =

∑σ∈Sr(x)

∑σ′∈Sr(x′)Ks(σ, σ

′)

Sr(x): subgraphs of x composed of r distinct vertices

Ks(σ, σ′) =

∏r(r−1)/2i=1 δ(ei, e

′i)e−γ(li−l′i)

li = length of edge ei in x(e1, e2, . . . , er(r−1)/2 lexicographically ordered; γ ∈ R



3DDK: Performance [Ceroni et al., 2007]

Data 2D kernel Hist3D kernel 3DDK Circ3DDKMutag 90.4% 88.8% 86.7% 83.5%

BZR (loo) 82.0% 79.4% 78.4% 81.4%ER (loo) 87.0% 86.1% 82.3% 82.1%COX2 76.9% 78.6% 75.6% 75.2%



The pharmacophore kernel [Mahé et al., 2006]

pharmacophore p ∈ P(x): p = [(x1, l1), (x2, l2), (x3, l3)]

xi 3D coordinates of atom i of x; li = label of atom i

K(x, x′) =∑

p∈P(x)∑

p′∈P(x′)KP (p, p′)

KP (p, p′) = Kdist(d1, d

′1)Kdist(d2, d

′2)Kdist(d3, d

′3)Kfeat(l1, l

′1)Kfeat(l2, l

′2)Kfeat(l3, l

′3)

Kdist: RBF Gaussian Kdist(d, d′) = exp

(‖d−d′‖22σ2

)Kfeat: Dirac



3D LAP kernel [Hinselmann et al., 2010]

M : pairwise intramolecular matrix of inter-atomicgeometric distances



ConclusionHow relevant is 3D information?How good is 3D information?



Docking

VirtualHigh-Throughput

Screening

1. Find a target

2. Identifyhits


hits


and synthesis

5. Assay

High-throughput screening


Assay a large library of potentialdrugs against their target

Very costly

→ docking

→ virtual high-throughputscreening (vHTS)

Measuring performance


Imbalanced data

Typically, most compounds are inactive ⇒ many more negativethan positive examples

E.g. DHFR data set:99, 995 chemicals screened for activity against dihydrofolatereductase; < 0.2% active compounds

Accuracy is not appropriate:predicting all compounds negative⇒ accuracy = 99.8%

sensitivity= # True Positives# Positives

specificity= # True Negatives# Negatives

For many methods, the output is continuous⇒ accuracy, sensitivity and specificity depend on a threshold θ



Receiver-Operator Characteristic Curves

For all possible values of θ, report sensitivity and 1− specificityAUROC (Area under the ROC Curve) is a numerical measure ofpeformance

AUROC(random) = 0.5 and AUROC(optimal) = 1

0 1/6 1/3 1/2 2/3 5/6 1

01

/42

/43

/41

False Positive Rate

Tru

e P

ositiv

e R

ate

x

x x

x

x x x x

x x x

Inf

0.95 0.94

0.9

0.81 0.73 0.52 0.2

0.17 0.12 0.09

random

perfect

real

label prediction+ 0.95- 0.94+ 0.90+ 0.81- 0.73- 0.52- 0.20+ 0.17- 0.12- 0.09



Inhibition of DHFR: ROC Curves [Azencott et al., 2007]

method AUCIRV 0.71SVM 0.59kNN 0.59

MAX-SIM 0.54

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

RANDOM

IRV

SVM

MAXSIM



Precision-recall curves

Precision = # True Positives# Predicted Positives

Recall = sensitivity

0 1/4 2/4 3/4 1

01/5

2/5

3/5

4/5

1

Recall

Pre

cis

ion

x

x

x

x

x

x

x

xxx

0.95

0.94

0.9

0.81

0.73

0.52

0.2

0.170.120.09

perfect

real

Other applications


Other applications of graph mining in chemoinformatics

Database indexing and searchPrediction of 3D structures of small compoundsand proteinsReaction Prediction

References and further reading


[Azencott et al., 2007] Azencott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical

information and modeling 47, 965–974. 23, 24, 38, 39, 40, 50

[Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprintsusing integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, 2098–2109. 16

[Ceroni et al., 2007] Ceroni, A., Costa, F. and Frasconi, P. (2007). Classification of small molecules by two-and three-dimensionaldecomposition kernels. Bioinformatics 23, 2038–2045. 41, 42

[Helma et al., 2004] Helma, C., Cramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques forthe identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal ofchemical information and computer sciences 44, 1402–1411. 17, 18

[Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compoundsusing topological and three-dimensional local atom pair environments. Neurocomputing 74, 219–229. 36, 37, 44

[Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening withsupport vector machines. Journal of chemical information and modeling 46, 2003–2014. 43

[Menchetti et al., 2005] Menchetti, S., Costa, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd

International Conference on Machine Learning pp. 585–592, ACM, Bonn, Germany. 33, 34

[Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gBoost: a mathematical programmingapproach to graph classification and regression. Machine Learning 75, 69–89. 26, 27, 28, 29

[Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561. 30, 31

The end


Tomorrow: Projects Presentations

By 9:45 AM on Friday, March 1, 2013, please submit the following byemail to Prof. Borgwardt:

A short report on your project, that gives your answers to the ques-tions in Section 2 (You can ignore Section 1 here) in your exercisesheet

The code that you wrote as part of your project

Your presentation slides as a PDF.

Documents

Data Mining in Bioinformatics Day 9: Graph Mining in ... · Chemoinformatics Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 8 The chemical space 1060 possible small or-ganic