Indiana University School of David Wild – Research Overview April 2006. Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics

David Wild – Research Overview April 2006. Page 1 Indiana University School of

Research Update, April 2006

David Wild

Assistant Professor of Chemical Informatics

Indiana University School of Informatics, Bloomingtondjwild @ indiana.edu


Overview

• Smart mining of drug discovery information– Project goals– Workflow examples & demonstrations– Collaborations with scientists– Workflow interoperability

• Data mining of the DTP tumor cell line dataset

• Fast clustering of Pubchem using Divisive Kmeans & Linux clusters

• Distributed Drug Discovery for neglected diseases

• Visualization & end-user layer tools• Usability of chemical informatics tools• Collaboration areas with Peter Murray Rust group


Smart mining of drug discovery information

• Technique for making the large volumes and diverse sources of chemical & related information manageable for scientists

• Observation: many information needs of scientists are straightforward, but complex and time-consuming in implementation

• This project aims to match information needs with use-cases and workflows of web services, along with imaginative human interfaces

• Supported by Microsoft eScience grant


3-layer model

Purpose Technologies

Interaction Layer Interactive software for creative access and exploitation of information by humans

Microsoft Smart Clients, portlets, Java applets, email and browser clients, visualization technologies

Aggregation Layer Workflows and data schemas customized for particular domains, applications and users

BPEL, Taverna and other workflow modeling tools, aggregate web services

Web service layer Comprehensive data and computation provision including storage, calculation, semantics and meta-data exposed as web services

Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft .NET


Onlinedatabase

(e.g. PubChem)

Localdatabase

3D DockingTool

2D-3Dconverter

3Dvisualizer

UDDI (?)

New Structure Service

Search online databasesfor recent structures

Search local databasesfor recent structures

Merge Results

AGENT / SMART CLIENT

Parse requestSelect appropriate use cases

and/or web service(s)Schedule as necessary

Request from Human Interface

WSDLSOAP

atomic services

aggregate services

USE-CASE SCRIPT

Invoke New Structure ServiceConvert structures to 3DDock results & protein file

Extract any hitsReturn links for visualization





Web services implemented

• Database Services– Local DTP Tumor Cell Line Database– PDB Ligand Database– Distributed Drug Discovery Database

• OpenEye– FRED Docking– FILTER Property Calculation and Filtering– OMEGA 2D-3D Conversion

• BCI– Various BCI Clustering services

• VOTables• InChIGoogle• InChiServer• CMLRSSServer• CDK Web services• Open Babel


A protein implicated in tumor growth is supplied to the docking program (in this case HSP90 taken from the PDB 1Y4 complex)

The workflow employs our local NIH DTP database

service to search 200,000 compounds tested in human

tumor cellular assays for similar structures to the

ligand. Client portlets are used to browse these

structures

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

Similar structures are filtered for drugability, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex)

Correlation of docking results and “biological fingerprints” across the human tumor cell lines can help identify potential mechanisms of action of DTP compounds


Workflow interoperability

• Taverna SCUFL <-> BEPL conversion– Working with Beth Plale & Dennis Gannon at IU Computer Science

• Use of developing data standards for Chemical Informatics– CML & InChI– XML meta data

• Interoperability of Taverna with other workflow systems

• Use of workflows in experiment execution environments– See http://www.extreme.indiana.edu/portals/index.shtml


DTP Tumor Cell Line Data Mining

• Collaboration with Melanie Wu, Database & Data Mining expert at the School of Informatics

• Local PostgreSQL database exposed as a web service• Building on existing published data mining research

on this dataset• Current projects:

– Comparing compound clusterings based on structure (MACCS keys) and “bioprint” (vector of screening results)

– Investigating fingerprint and bioprint correlations with MOA’s of ~100 compounds (correlation is definitely found)

– Application of workflows to associate docking results with screening results

– Collaboration with Dr. Faming Zhang at IU Department of Chemistry for mining of Kinase-related information

• Next projects:– Correlation of structural and gene expression information

(without naïve combination of screen & gene information)– Application of COMPARE– Integration into a wider oncology information system


Database architecture

• Using PostgreSQL database with gNova CHORD for structure & fingerprint searching, exposed as a web service

• Compound table contains ~200,000 SMILES, ID, properties, MACCS keys in compound table

• Screen tables contain GI50/LD50/TGI values, and gene expression table (in development)

• Can search on mix of structure and numeric / categorical data

• Active research into optimizing searching efficiency


Cluster Analysis and Chemical Informatics• Used for organizing datasets into chemical series, to

build predictive models, or to select representative compounds

• Organizational usage has not been as well studied as the other two, but see– Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint

Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, 155-162.

• Essentially helping large datasets become manageable• Methods used:

– Jarvis-Patrick and variants• O(N2), single partition

– Ward’s method• Hierarchical, regarded as best, but at least O(N2)

– K-means• < O(N2), requires set no of clusters, a little “messy”

– Sphere-exclusion (Butina)• Fast, simple, similar to JP

– Kohonen network• Clusters arranged in 2D grid, ideal for visualization


Limitations of Ward’s for large datasets (>1m)

• Best algorithms have O(N2) time requirement (RNN)

• Requires random access to fingerprints– hence substantial memory requirements (O(N))

• Problem of selection of best partition– can select desired number of clusters

• Easily hit 4GB memory addressing limit on 32 bit machines– Approximately 2m compounds


Divisive K-means Clustering

• New hierarchical divisive method – Hierarchy built from top down, instead of bottom up

– Divide complete dataset into two clusters– Continue dividing until all items are singletons– Each binary division done using K-means method– Originally proposed for document clustering

• “Bisecting K-means”– Steinbach, Karypis and Kumar (Univ. Minnesota)http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf

– Found to be more effective than agglomerative methods

– Forms more uniformly-sized clusters at given level


BCI Divkmeans

• Several options for detailed operation– Selection of next cluster for division– size, variance, diameter– affects selection of partitions from hierarchy, not shape

of hierarchy

• Options within each K-means division step – distance measure– choice of seeds– batch-mode or continuous update of centroids– termination criterion

• Have developed MPI parallel version for Linux clusters / grids in conjunction with BCI (now Digital Chemistry)

• For more information, see Barnard and Engels talks at: http://cisrg.shef.ac.uk/shef2004/conference.htm

• Now available as a web service at IU (along with other BCI programs)


Comparative execution times

7h 27m

3h 06m

2h 25m

44m0

5000

10000

15000

20000

25000

30000

0 20000 40000 60000 80000 100000 120000Number of Structures in Clustered Set

Execution Time (s)

Wards

K-means

Divisive K-means

Parallel Divisive Kmeans (4-node)

NCI subsets, 2.2 GHz Intel Celeron processor


250

300

350

400

450

500

550

600

650

700

0 10 20 30 40 50 60 70 80 90

Number of processors

Runtime (seconds)

Minsize 1 Minsize 100 Minsize 1000

MPI Parallel Divkmeans clustering of PubChemAVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005)

min_size ncpus wall_mins walltime1 20 676 11:16:061 40 444 7:24:241 60 379 6:18:411 80 353 5:53:00

100 20 462 7:41:58100 40 356 5:56:01100 40 356 5:55:47100 60 339 5:38:44100 80 337 5:36:53

1000 20 513 8:32:391000 40 376 6:16:251000 60 346 5:46:221000 80 346 5:45:40


Distributed Drug Discovery

• Project run by Dr. Bill Scott at IUPUI• Tackling neglected diseases using distributed chemistry (while educating undergraduates about combinatorial chemistry)

• Each student makes 4 compounds on cheap equipment. Each class will typically make around 60 compounds. Many universities participating around the world

• Reaction transformations, virtual and made compounds stored in PostgreSQL database exposed as a web service

• This information can then be drawn into our workflows. For example, searches for similar compounds can be done on Pubchem, Tumor Cell Line database, etc


Distributed Drug Discovery

William L. Scott Distributed Drug Discovery A Distributed Drug Discovery Concept to Search for Developing World Disease Drug Leads


Visualization and end-user tools

• PubChemSR• 2D structure visualizer using CDK• VoPlot• VisualiSAR - modal fingerprints• Similarity Matrix Visualization• General approaches to end user tools

– Portlets and .NET– Usability & Contextual Design


PubChemSR (Junguk Hur)

http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR


Simple 2D viewer applet (using CDK) - David Jiao


VoPlot


with a nod to Edward Tufte.See http://www.daylight.com/meetings/mug99/Wild/Mug99.html

VisualiSAR - modal fingerprints


QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.Original (curated) Breadth-first Search

Degree Sloan’s Algorithm

Data: NCI Compound Database - Compounds with positive AIDS screens

Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data

items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal

information about the data.

Additional details are displayed as property

plots. Here, the different computed properties are displayed along with the

main matrix.

Student: Christopher Mueller

In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include

optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments.

Visual Similarity Matrices


General approaches to end-user tools• Main interface-level vehicle should be portlets, allowing

reuse and interchangability• Other interfaces, such as .NET clients, email and RSS

interfaces will also be investigated• No matter how clever the smarts underneath, the overriding

factor in usefulness will be the quality of scientists’ interaction with the system

• Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right peoplein chemical informatics [collaboration with HCI?]

• Possibility of multiple interfaces for different people groups(Cooper’s “primary personas”)

• Don’t assume the browser interface – email / NLP ?• Start with the basics

– 2D chemical structure drawing (input)– Visualization of large numbers of chemical structures in 2D– 3D chemical structure visualization

• Current project is looking at usability of online chemical databases(including PubChem)


• Key difference between “sequential” and “random” drawers

• Huge difference in intuitiveness• Key factor how badly you can mess things up• Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw

Usability of 2D structure drawing tools


Cambridge-Indiana Collaboration

• Weekly Access Grid meetings• Bringing together areas of expertise in the UK and USA

• Applying OSCAR text mining to NIH data• Looking toward joint presentations & publications


Cambridge-Indiana Collaboration


Contributors

• My students– Xiao Dong– Huijung Wang– Jason Lee– Junguk Hur– David Jaio– Usha Cheemakurthi– Waiping Kam

• Geoffrey’s group at CGL– Marlon Pierce– Jake Kim– Sima Patel– Smitha Ajay

• Others– Gary Wiggins– Melanie Wu– Dennis Gannon– Beth Plale– Rajarshi Guha– Peter Murray Rust– Peter Corbett– Dan Zaharevitz

Documents

Indiana University School of David Wild – Research Overview April 2006. Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics