View
31
Download
0
Category
Preview:
DESCRIPTION
Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk I. July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 - PowerPoint PPT Presentation
Citation preview
11
Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers
for Cheminformatics Research (ECCR): Talk I
July 17 2006Geoffrey Fox
Computer Science, Informatics, PhysicsPervasive Technology Laboratories
Indiana University Bloomington IN 47401gcf@indiana.edu
http://www.infomall.orghttp://www.chembiogrid.org
With apologies for my credentials. I have written a few papers on Biology, Chemistry and Crystallography while at Cambridge, Caltech and Syracuse Mostly on applications of parallel computing
22
Start-up and Organization Local Teams, successful Prototypes and International
Collaboration set up in 3 major focus areas• “Tool and Data” Cyberinfrastructure• “Archival Database and Simulation” Cyberinfrastructure• Education
Wiki chosen to support project as a shared editable web space Web site http://www.chembiogrid.org Building Collaboratory involving PubChem – Global Information
System accessible anywhere and at any time – enhance PubChem with distributed tools (clustering, simulation, annotation etc.) and data
Initial results discussed at conferences/workshops/papers• Gordon Conferences, ACS, SDSC tutorial
First new Cheminformatics courses offered Advisory board set up and met Videoconferencing-based meetings with Peter Murray-Rust and group
at Cambridge roughly every 2-3 weeks Good interactions with NIH DTP, Lilly and Michigan ECCR
33
http://www.chembiogrid.org
44
CICC Senior Personnel Geoffrey C. Fox Mu-Hyun (Mookie) Baik Dennis B. Gannon Marlon Pierce Beth A. Plale Gary D. Wiggins David J. Wild Yuqing (Melanie) Wu
Peter T. Cherbas Mehmet M. Dalkilic Charles H. Davis A. Keith Dunker Kelsey M. Forsythe Kevin E. Gilbert John C. Huffman Malika Mahoui Daniel J. Mindiola Santiago D. Schnell William Scott Craig A. Stewart David R. Williams
From Biology, Chemistry, Computer Science, Informatics
at IU Bloomington and IUPUI (Indianapolis)
55
CICC Advisory Board Alan D. Palkowitz (Eli Lilly) Andrew Martin (Kalypsys) David Spellmeyer (IBM) Dimitris K. Agrafiotis (Johnson & Johnson) Horst Hemmerle (Eli Lilly) James M. Caruthers (Purdue University) Jeremy G. Frey (University of Southampton) Joel Saltz (Ohio State University/University of Maryland/Johns
Hopkins University) John M. Barnard (Digital Chemistry) John Reynders (Eli Lilly) Peter Murray-Rust (University of Cambridge) Peter Willett (University of Sheffield) Thompson Doman (Eli Lilly) Val Gillet (University of Sheffield)
Industry andAcademiaMet October 2005will meet this fall
66
PublicationsBaik says he is especially productive due to Cyberinfrastructure
77
Our Meetings are on the Web
8
Varuna environment for molecular modeling (Baik, IU)
QMDatabase
ResearcherResearcher
Simulation ServiceFORTRAN Code,
Scripts
Chemical Concepts
Experiments
QM/MMDatabasePubChem, PDB,
NCI, etc.
ChemBioGridChemBioGrid
ReactionDB
DB ServiceQueries, Clustering,
Curation, etc.
Papersetc.
Condor
TeraGridSupercomputers
“Flocks”
99
Cyberinfrastructure and Grids These support eScience or distributed Computers,
Databases, Instruments, Sensors and People Grids use large scale managed Web services – the current major
technology building on modern Industry enterprise and Internet systems• W3C, OASIS, OGF or Open Grid Forum (Fox VP for
eScience) develops standards insuring distributed resources interoperate
Cheminformatics benefits from 2 styles of Grids• TeraGrid typifies Grid support of large scale computation of
parallel simulations• Bioinformatics (BIRN, caBIG, MyGrid …), Earth Science
and Astronomy Grids illustrate integration of real-time and archival data(bases) and computation
Well designed Grids run faster than older approaches
1010
Cheminformatics Grids Need Broad System standards such as WSDL, SOAP,
WSRM, JSDL, BPEL Domain specific data structures
• CML Cheminformatics• GML Earth Science• CellML, SBML Biology• VOQL Astronomy
Use of specific Grid/Web service technologies such as• Web services directly for tools• Web service proxies for large simulation codes – ANYTHING
can be made a Web service efficiently if execution/network access time ≥ 20ms
• Portals/Portlets for user interfaces• Workflow for composition
Access to data and compute resources
TeraGrid: Integrating NSF Cyberinfrastructure
TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.
SDSCTACC
UC/ANL
NCSA
ORNL
PU
IUPSCNCAR
Caltech
USC-ISI
UtahIowa
Cornell
Buffalo
UNC-RENCI
Wisc
12
Top500Supercomputers
in the world
Indiana University has Highest Performance
U.S. Academic Computer System20 Teraflops peak
1313
Products and Demonstrationswww.
chembiogrid.org
Note mixture ofIn-house
Out of HouseCommercial
Academic
CICC Prototype Web Services
Molecular weightsMolecular formulaeTanimoto similarity2D Structure diagramsMolecular descriptors3D structuresInChi generation/searchCMLRSS
Basic cheminformatics
Application based services
Compare (NIH)Toxicity predictions (ToxTree)Literature extraction (OSCAR3)Clustering (BCI Toolkit)Docking, filtering, ... (OpenEye)Varuna simulation
Define WSDL interfaces to enable global production of compatible Web services; refine CML Ready to try “Prototype Production” Develop more training material Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies
Next steps?
Key Ideas
Add value to PubChem with additional distributed services and databases Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical tools Provide access to key databases via a web service interface Provide access to major Compute Grids
Web Service LocationsIndiana University
Clustering VOTables OSCAR3 Toxicity classification Database services
Penn State UniversityCDK based services
Fingerprints Similarity calculations 2D structure diagrams Molecular descriptors
Cambridge University InChi generation / search CMLRSS OpenBabel
InfoChem SPRESI
database
SDSCTypical TeraGrid Site
NIHPubChem …..Compare …..
Usage of Open Source Projects
A number of open source projects are used in our infrastructure CDK provides the underlying cheminformatics toolkit R provides the back-end modeling capabilities OSCAR is used for literature mining ToxTree is used to provide toxicity classification Open data and standards as promoted by the Blue
Obelisk project
Contributions to Open Source Projects
We also contribute functionality to these projects Molecular descriptor development to the CDK Modifications of various CDK functionality to make
them suitable for web service usage Infrastructure for accessing R from the CDK Packages to use the CDK from within R Quality control, testing and documentation
Steinbeck, C. et al.; Curr. Pharm. Des., 2006, 12(17), 2110-2120Guha, R.; CDK News, 2005, 2(1), 7-13
Workflows Using Chemical Literature
OSCAR3program
All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red
SMILES NAME Pubmed IDCCC propane 1425356CC ethane 3546453..... ............. .............
Bulk download ofPubmed abstracts
Extract chemical structures
OSCAR3Service
Find similarmolecules
Searchable(structure/similarity)Grid database
Local DTP database
PubChem
PDBBind
Find similardocuments
Clustering of documents linked to clustering of chemicals
19
ExistingUser Interface
Document-enhanced Cyberinfrastructure
etc.
Google Scholar
ManuscriptCentral
Science.gov
Windows Live Academic Search
Citeseer
CMT Conference
Management
Existing Document-basedResearch Tools
Web serviceWrappers
New Document-enhancedResearch Tools including
Web2.0, Mashups, Annotation
Integration/EnhancementUser Interface
Community Tools
Generic Document Tools
MyResearchDatabase
Bibliographic Database
Export:RSS, BibtexEndnote etc.
CiteULike
Connotea
Del.icio.us
Bibsonomy
BioliciousPubChem
PubMed
TraditionalCyberinfrastructure
2020
Products and Demonstrations II
David Wild – Research Overview July 2006. Page 21 Indiana University School of
Example HTS workflow: organization & flagging
A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database)
The compounds are clustered on
chemical structure similarity, to group similar compounds
together
The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT
OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs
Taverna Workflow
22
LoadWorkflow
RunWorkflow
CurrentProcess
Result Output
ResultOutputURL
2323
Lilly very interested in our new educational programs
24
Total Grad Enrollment: Chem-, Lab, Bio-, Health Informatics, Fall 2005
Red = Expected, Chem, Fall 2006
MS Chem Lab Bio HealthIUB 3/3 0 38 0
IUPUI 6/3 15 34 36TOTAL 9/6 15 72 36
PhD Chem Lab Bio HealthIUB 1/3 0 3 0
IUPUI 0/1 0 4 3TOTAL 1/4 0 7 3
25
Formal Cheminformatics Courses• I571 Chemical Information Technology (3 cr.)
– Distance Ed section had 10 students in Fall 2005, from California to Connecticut
• I572 Computational Chemistry and Molecular Modeling (3 cr.)
• I573 Programming Techniques for Chemical and Life Science Informatics (3 cr.)
• I553 Independent Study in Chemical Informatics (3 cr.)• Above courses required for the new Graduate
Certificate Program in Chemical Informatics• Also I533 (Cheminformatics seminar)
26
More detailed Slides not used
2727
TeraGrid Hardware and Software TeraGrid is coordinated at the University of Chicago
and includes 8 partner facilities• NCSA, SDSC, PSC, ORNL, IU, PU, TACC, UC/ANL
TeraGrid hardware totals > 102 teraflops of computing power.• Comprehensive information available from
http://www.teragrid.org/userinfo/hardware/overview.php.• Systems are primarily Linux clusters.
Grid software and services (Globus, MyProxy, etc) provide a uniform means for accessing TeraGrid resources.• Scheduling, running and monitoring jobs• Monitoring resources• Moving and managing remote files.• Common service APIs simplify the process for building remote
tools.
28
Prototype CICC Project: Controlling the TGF pathwayCollaboration between Baik & Zhang at IU
PDB
1IAS1IASInactive TGF
VARUNA
Experimentsin the Zhang
Lab
Active TGFActive TGFWith inhibitorWith inhibitor
PubChem
in-house Molecules in Varuna
Conceptual Conceptual Understanding of Understanding of TGFTGF
InhibitionInhibition
Simulations AutoGeFFAutoGeFF
Questions:
- What molecular feature controls inhibitor binding?
- How do mutations impact binding?
Web Service togenerate customforce fields
29
MLSCN Data - How services and workflows are used
MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback
Data is stored in Pubchem
Workflows perform different kinds of analysis on the MLSCN data - the variety of workflows is limitlessEnd-user
applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis
PubChem interfaces to workflows via SOAP
Recommended