Upload
sebastian-blevins
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
David Wild – Research Overview April 2006. Page 1 Indiana University School of
Research Update, April 2006
David Wild
Assistant Professor of Chemical Informatics
Indiana University School of Informatics, Bloomingtondjwild @ indiana.edu
David Wild – Research Overview April 2006. Page 2 Indiana University School of
Overview
• Smart mining of drug discovery information– Project goals– Workflow examples & demonstrations– Collaborations with scientists– Workflow interoperability
• Data mining of the DTP tumor cell line dataset
• Fast clustering of Pubchem using Divisive Kmeans & Linux clusters
• Distributed Drug Discovery for neglected diseases
• Visualization & end-user layer tools• Usability of chemical informatics tools• Collaboration areas with Peter Murray Rust group
David Wild – Research Overview April 2006. Page 3 Indiana University School of
Smart mining of drug discovery information
• Technique for making the large volumes and diverse sources of chemical & related information manageable for scientists
• Observation: many information needs of scientists are straightforward, but complex and time-consuming in implementation
• This project aims to match information needs with use-cases and workflows of web services, along with imaginative human interfaces
• Supported by Microsoft eScience grant
David Wild – Research Overview April 2006. Page 4 Indiana University School of
3-layer model
Purpose Technologies
Interaction Layer Interactive software for creative access and exploitation of information by humans
Microsoft Smart Clients, portlets, Java applets, email and browser clients, visualization technologies
Aggregation Layer Workflows and data schemas customized for particular domains, applications and users
BPEL, Taverna and other workflow modeling tools, aggregate web services
Web service layer Comprehensive data and computation provision including storage, calculation, semantics and meta-data exposed as web services
Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft .NET
David Wild – Research Overview April 2006. Page 5 Indiana University School of
Onlinedatabase
(e.g. PubChem)
Localdatabase
3D DockingTool
2D-3Dconverter
3Dvisualizer
UDDI (?)
New Structure Service
Search online databasesfor recent structures
Search local databasesfor recent structures
Merge Results
AGENT / SMART CLIENT
Parse requestSelect appropriate use cases
and/or web service(s)Schedule as necessary
Request from Human Interface
WSDLSOAP
atomic services
aggregate services
USE-CASE SCRIPT
Invoke New Structure ServiceConvert structures to 3DDock results & protein file
Extract any hitsReturn links for visualization
David Wild – Research Overview April 2006. Page 6 Indiana University School of
David Wild – Research Overview April 2006. Page 7 Indiana University School of
David Wild – Research Overview April 2006. Page 8 Indiana University School of
David Wild – Research Overview April 2006. Page 9 Indiana University School of
Web services implemented
• Database Services– Local DTP Tumor Cell Line Database– PDB Ligand Database– Distributed Drug Discovery Database
• OpenEye– FRED Docking– FILTER Property Calculation and Filtering– OMEGA 2D-3D Conversion
• BCI– Various BCI Clustering services
• VOTables• InChIGoogle• InChiServer• CMLRSSServer• CDK Web services• Open Babel
David Wild – Research Overview April 2006. Page 10 Indiana University School of
A protein implicated in tumor growth is supplied to the docking program (in this case HSP90 taken from the PDB 1Y4 complex)
The workflow employs our local NIH DTP database
service to search 200,000 compounds tested in human
tumor cellular assays for similar structures to the
ligand. Client portlets are used to browse these
structures
Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.
Similar structures are filtered for drugability, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.
A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex)
Correlation of docking results and “biological fingerprints” across the human tumor cell lines can help identify potential mechanisms of action of DTP compounds
David Wild – Research Overview April 2006. Page 11 Indiana University School of
Workflow interoperability
• Taverna SCUFL <-> BEPL conversion– Working with Beth Plale & Dennis Gannon at IU Computer Science
• Use of developing data standards for Chemical Informatics– CML & InChI– XML meta data
• Interoperability of Taverna with other workflow systems
• Use of workflows in experiment execution environments– See http://www.extreme.indiana.edu/portals/index.shtml
David Wild – Research Overview April 2006. Page 12 Indiana University School of
DTP Tumor Cell Line Data Mining
• Collaboration with Melanie Wu, Database & Data Mining expert at the School of Informatics
• Local PostgreSQL database exposed as a web service• Building on existing published data mining research
on this dataset• Current projects:
– Comparing compound clusterings based on structure (MACCS keys) and “bioprint” (vector of screening results)
– Investigating fingerprint and bioprint correlations with MOA’s of ~100 compounds (correlation is definitely found)
– Application of workflows to associate docking results with screening results
– Collaboration with Dr. Faming Zhang at IU Department of Chemistry for mining of Kinase-related information
• Next projects:– Correlation of structural and gene expression information
(without naïve combination of screen & gene information)– Application of COMPARE– Integration into a wider oncology information system
David Wild – Research Overview April 2006. Page 13 Indiana University School of
Database architecture
• Using PostgreSQL database with gNova CHORD for structure & fingerprint searching, exposed as a web service
• Compound table contains ~200,000 SMILES, ID, properties, MACCS keys in compound table
• Screen tables contain GI50/LD50/TGI values, and gene expression table (in development)
• Can search on mix of structure and numeric / categorical data
• Active research into optimizing searching efficiency
David Wild – Research Overview April 2006. Page 14 Indiana University School of
Cluster Analysis and Chemical Informatics• Used for organizing datasets into chemical series, to
build predictive models, or to select representative compounds
• Organizational usage has not been as well studied as the other two, but see– Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint
Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, 155-162.
• Essentially helping large datasets become manageable• Methods used:
– Jarvis-Patrick and variants• O(N2), single partition
– Ward’s method• Hierarchical, regarded as best, but at least O(N2)
– K-means• < O(N2), requires set no of clusters, a little “messy”
– Sphere-exclusion (Butina)• Fast, simple, similar to JP
– Kohonen network• Clusters arranged in 2D grid, ideal for visualization
David Wild – Research Overview April 2006. Page 15 Indiana University School of
Limitations of Ward’s for large datasets (>1m)
• Best algorithms have O(N2) time requirement (RNN)
• Requires random access to fingerprints– hence substantial memory requirements (O(N))
• Problem of selection of best partition– can select desired number of clusters
• Easily hit 4GB memory addressing limit on 32 bit machines– Approximately 2m compounds
David Wild – Research Overview April 2006. Page 16 Indiana University School of
Divisive K-means Clustering
• New hierarchical divisive method – Hierarchy built from top down, instead of bottom up
– Divide complete dataset into two clusters– Continue dividing until all items are singletons– Each binary division done using K-means method– Originally proposed for document clustering
• “Bisecting K-means”– Steinbach, Karypis and Kumar (Univ. Minnesota)http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf
– Found to be more effective than agglomerative methods
– Forms more uniformly-sized clusters at given level
David Wild – Research Overview April 2006. Page 17 Indiana University School of
BCI Divkmeans
• Several options for detailed operation– Selection of next cluster for division– size, variance, diameter– affects selection of partitions from hierarchy, not shape
of hierarchy
• Options within each K-means division step – distance measure– choice of seeds– batch-mode or continuous update of centroids– termination criterion
• Have developed MPI parallel version for Linux clusters / grids in conjunction with BCI (now Digital Chemistry)
• For more information, see Barnard and Engels talks at: http://cisrg.shef.ac.uk/shef2004/conference.htm
• Now available as a web service at IU (along with other BCI programs)
David Wild – Research Overview April 2006. Page 18 Indiana University School of
Comparative execution times
7h 27m
3h 06m
2h 25m
44m0
5000
10000
15000
20000
25000
30000
0 20000 40000 60000 80000 100000 120000Number of Structures in Clustered Set
Execution Time (s)
Wards
K-means
Divisive K-means
Parallel Divisive Kmeans (4-node)
NCI subsets, 2.2 GHz Intel Celeron processor
David Wild – Research Overview April 2006. Page 19 Indiana University School of
250
300
350
400
450
500
550
600
650
700
0 10 20 30 40 50 60 70 80 90
Number of processors
Runtime (seconds)
Minsize 1 Minsize 100 Minsize 1000
MPI Parallel Divkmeans clustering of PubChemAVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005)
min_size ncpus wall_mins walltime1 20 676 11:16:061 40 444 7:24:241 60 379 6:18:411 80 353 5:53:00
100 20 462 7:41:58100 40 356 5:56:01100 40 356 5:55:47100 60 339 5:38:44100 80 337 5:36:53
1000 20 513 8:32:391000 40 376 6:16:251000 60 346 5:46:221000 80 346 5:45:40
David Wild – Research Overview April 2006. Page 20 Indiana University School of
Distributed Drug Discovery
• Project run by Dr. Bill Scott at IUPUI• Tackling neglected diseases using distributed chemistry (while educating undergraduates about combinatorial chemistry)
• Each student makes 4 compounds on cheap equipment. Each class will typically make around 60 compounds. Many universities participating around the world
• Reaction transformations, virtual and made compounds stored in PostgreSQL database exposed as a web service
• This information can then be drawn into our workflows. For example, searches for similar compounds can be done on Pubchem, Tumor Cell Line database, etc
David Wild – Research Overview April 2006. Page 21 Indiana University School of
Distributed Drug Discovery
William L. Scott Distributed Drug Discovery A Distributed Drug Discovery Concept to Search for Developing World Disease Drug Leads
David Wild – Research Overview April 2006. Page 22 Indiana University School of
Visualization and end-user tools
• PubChemSR• 2D structure visualizer using CDK• VoPlot• VisualiSAR - modal fingerprints• Similarity Matrix Visualization• General approaches to end user tools
– Portlets and .NET– Usability & Contextual Design
David Wild – Research Overview April 2006. Page 23 Indiana University School of
PubChemSR (Junguk Hur)
http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR
David Wild – Research Overview April 2006. Page 24 Indiana University School of
Simple 2D viewer applet (using CDK) - David Jiao
David Wild – Research Overview April 2006. Page 25 Indiana University School of
VoPlot
David Wild – Research Overview April 2006. Page 26 Indiana University School of
with a nod to Edward Tufte.See http://www.daylight.com/meetings/mug99/Wild/Mug99.html
VisualiSAR - modal fingerprints
David Wild – Research Overview April 2006. Page 27 Indiana University School of
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.Original (curated) Breadth-first Search
Degree Sloan’s Algorithm
Data: NCI Compound Database - Compounds with positive AIDS screens
Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data
items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal
information about the data.
Additional details are displayed as property
plots. Here, the different computed properties are displayed along with the
main matrix.
Student: Christopher Mueller
In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include
optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments.
Visual Similarity Matrices
David Wild – Research Overview April 2006. Page 28 Indiana University School of
General approaches to end-user tools• Main interface-level vehicle should be portlets, allowing
reuse and interchangability• Other interfaces, such as .NET clients, email and RSS
interfaces will also be investigated• No matter how clever the smarts underneath, the overriding
factor in usefulness will be the quality of scientists’ interaction with the system
• Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right peoplein chemical informatics [collaboration with HCI?]
• Possibility of multiple interfaces for different people groups(Cooper’s “primary personas”)
• Don’t assume the browser interface – email / NLP ?• Start with the basics
– 2D chemical structure drawing (input)– Visualization of large numbers of chemical structures in 2D– 3D chemical structure visualization
• Current project is looking at usability of online chemical databases(including PubChem)
David Wild – Research Overview April 2006. Page 29 Indiana University School of
• Key difference between “sequential” and “random” drawers
• Huge difference in intuitiveness• Key factor how badly you can mess things up• Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw
Usability of 2D structure drawing tools
David Wild – Research Overview April 2006. Page 30 Indiana University School of
Cambridge-Indiana Collaboration
• Weekly Access Grid meetings• Bringing together areas of expertise in the UK and USA
• Applying OSCAR text mining to NIH data• Looking toward joint presentations & publications
David Wild – Research Overview April 2006. Page 31 Indiana University School of
Cambridge-Indiana Collaboration
David Wild – Research Overview April 2006. Page 32 Indiana University School of
Contributors
• My students– Xiao Dong– Huijung Wang– Jason Lee– Junguk Hur– David Jaio– Usha Cheemakurthi– Waiping Kam
• Geoffrey’s group at CGL– Marlon Pierce– Jake Kim– Sima Patel– Smitha Ajay
• Others– Gary Wiggins– Melanie Wu– Dennis Gannon– Beth Plale– Rajarshi Guha– Peter Murray Rust– Peter Corbett– Dan Zaharevitz