21
The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

Embed Size (px)

Citation preview

Page 1: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

The new JKlustor suite

Miklós Vargyas

Solutions for Cheminformatics

Page 2: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

2

Why do we cluster?

• to reduce the number of objects to deal with– group subsets together– represent each group by one member of it

Page 3: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

3

What’s the matter with that?

– tedious• parameter tuning in a trial-and-error fashion

– lack of interpretability• the algorithm does not provide explanation

– often do not meet chemists’ expectation

Page 4: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

4

Why is clustering molecules hard?

• lack of innate spatial arrangement– artificial arrangement

• infinite types of chemical spaces• various ‘distance metrics’• usually high dimensionality (hard to visualize)

– various approaches, no superior one• “best method” depends on application area, and on

actual data

Page 5: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

5

What do we need?

• no/few tuning

• easy to understand simple “explanation”

• novel approach– structure based clustering– Maximum Common Substructure– Molecular frameworks

Page 6: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

6

Maximum Common Substructure

• largest substructure shared by two molecules

• Simple concept! More “human”, visual.

• Yet hard (= expensive (= slow)) to compute.

Page 7: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

7

MCS complexity

• Sub-structure searching – query structure is known, it “only” have to be found

as part of the target structure (subgraph isomorphism)

– graph isomorphism is even “simpler” yet NP-hard• finding the answer can take long (scales exponentially

with respect to the number graph vertexes) in the worst case

• validating an answer is fast

• MCS – “query” structure is not known

– all possible substructures need to be checked• even the number of substructures is exponential!

Page 8: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

8

MCS algorithms

• two camps

backtracking clique detection

ad hoc high mathematical elegance

average complexity is better than worst case

average complexity is same as worst case

dynamic heuristics static (initial) heuristics

coloring is easy coloring is hard

fuzzy matching fussy matching

Page 9: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

9

MCS of a structure set

Page 10: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

10

LibraryMCS: Hierarchical MCS

Page 11: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

11

Intuitive visualization

Page 12: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

12

SAR table view

Page 13: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

13

R-group decomposition

Page 14: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

14

LibraryMCS scales linearly

-500

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (s

ec)

2006

2007

Linear (2007)

Page 15: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

15

Clustering performance comparison

0

10

20

30

40

50

60

70

80

90

0 20000 40000 60000 80000 100000 120000

Structure count

Run

ning

tim

e (m

in)

LibraryMCSJarvis-PatrickWard-Murtagh

Page 16: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

16

Behind performance

• MCS search– exhaustive– heuristics

• exact• inexact

• Predictive MCS coupling in clustering– all pairs are not feasible– rich fingerprinting

Page 17: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

17

Live demonstration

• Affect of use of heuristics– on average < 10% misclassifications– useful for obtaining birds-eye-view of a

larger/diverse sets

Page 18: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

1M< compounds libraries

• Molecular scaffolds, – Rings, ring systems– Bemis-Murcko frameworks

Page 19: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

• Sphere exclusion – Variants… linear scaling

Fast clustering methods

Page 20: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

20

Jklustor roadmap

• In the dev pipeline– IJC integration– Spotfire integration– new dynamic viewer

• Planned– disconnected MCS– multiple class members

Page 21: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics

21

Acknowledgements

• Gábor Imre

• Judit Vaskó-Szedlár

• Péter Vadász