Upload
dina-byrd
View
214
Download
0
Embed Size (px)
Citation preview
Genomes to Grids Thoughts on Building Data Grids for Biology
Biologists have discovered many millions of genes and genome features, now part of the bio-data "library" distributed on computers around the world. Grid computing methods for finding and using interesting genome knowledge from this mountain of data are discussed - their promise and practical concerns for building usable bioinformatics grids.
Don Gilbert, [email protected]
Bio Databanks, EBI, Sept. 2002
Databank Contents EntriesEMBL DNA Sequences 18,800,000SWALL Protein sequences 900,000InterPro+ Protein motifs 1,000,000HGBASE SNP database 1,500,000
Metabolic Pathways 250,000MEDLINE Literature 11,350,000Total 33,800,000
Many data objects, data sets updated frequently (daily) --> Keeping current data is a problem
Constellation of Bio-Data (SRS - Lion Bioscience)
Many databanks, variously structured,widely distributed, loosely federated - finding “best data” a problem
Genome database & info system components
Any genome database relies on, and feeds into, many other databases
BioGrid Schematic
• Grid-aware client software• Data and software directories
• Grid of processing computers
Moving Bio-Data on the Grid1. @virtualdata= biodirectory( "find protein coding sequences
for Drosophila and Anopheles species”)
2. @realdata= biodirectory( "get locators for @virtualdata split n ways”)
3. for i (1.. n) { copydata(realdata[i],gridcpu[i]); runapp(gridcpu[i]) }
Directories of Bio-Data• Directories are a necessary step for usable grids of bio-data
– "broad and shallow" directories federate the "narrow and deep" databases
• Bio-Data Access Tools
SRS, Sequence Retrieval System; Entrez ; AceDB; Genome relational databases (Ensembl, FlyBase, WormBase) ; IBM DiscoveryLink; BioDAS ; BioMoby
• Directory services for data access tools– Layer onto access tools for common query/retrieval of important data– LDAP: mature, efficient for high volumes, queries over distributed
directories ; works well with bio-access tools
– Web Services: XML messages over Web ; wide industry support , but standards are in progress
Bio-data Directory Needs• Build on existing technology for finding distributed objects• Efficient for millions of objects, by the gigabyte and terabyte• Queries distributed across directories of collaborating
services• Support existing and new bioinformatics data access
(relational dbs, object and XML dbs, SRS, Entrez, AceDB)• Simple client program methods for computable use of
directories• Flexible, common schema for describing objects• Replicate directories and objects among bioinformatics
centers• Peer-to-peer directories for collaborative projects• Strong authentication and security for data access
Directory technology: LDAP,Web Services and/or?
• LDAP– Object-centric, optimized for efficient read operations.
– Hierarchical, network-able, distributed and replicated in nature
– Has many features needed for bio-data access
• Web/XML – SOAP+: SOAP for directory requests, WSDL to interface the directory
repository, UDDI to locate the service (some assembly still required…)
– UDDI is potential match to LDAP as directory technology
– DSML: layer on top of LDAP for Web/XML interoperability
• Peer-to-peer (JXTA)? Grid SQL? XML-query systems? – Possible future directory technology
BioDirectory Tests• SRS bioinformatics data retrieval system, for efficient
retrieval of millions of bio-objects
• OpenLDAP for high performance and JavaLDAP for easy to configure directory transport.
• GLUE and Jakarta/Tomcat for Web Services tests. • DSML, Directory Services markup for XML/LDAP
conversion.• Test queries: 20,000 to 1.2 million biosequence objects from
GenBank, SwissProt and related dbs.
IUBio SRS Server + LDAP, WebServices --> Bio-object directory search/retrieval
Using Bio DirectoriesSimple client
software
Automated use
People use
Discovery
Search by many criteria
Retrieve bulk subsets
BioGrid Runner
A Globus CoG kit application for bioinformaticshttp://iubio.bio.indiana.edu/biogrid/runner/
Wrap up• Future of Bio-data on Grids
– Globus Toolkit useful for bio-grid data & compute intensive tools (BLAST, HMMer, Meme, others)
– High volume, complex, changing, distributed data– Add methods to find & move data among grid,
diretories of objects– LDAP works well ; Web-XML is usable, being defined
• Bio. Community needs and uses– Common data descriptions, schema, ontologies – Simple, practical, flexible grid methods ; use existing
dbs
See http://iubio.bio.indiana.edu/biogrid/