Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data

Genomes to Grids Thoughts on Building Data Grids for Biology

Biologists have discovered many millions of genes and genome features, now part of the bio-data "library" distributed on computers around the world. Grid computing methods for finding and using interesting genome knowledge from this mountain of data are discussed - their promise and practical concerns for building usable bioinformatics grids.

Don Gilbert, [email protected]

Bio Databanks, EBI, Sept. 2002

Databank Contents EntriesEMBL DNA Sequences 18,800,000SWALL Protein sequences 900,000InterPro+ Protein motifs 1,000,000HGBASE SNP database 1,500,000

Metabolic Pathways 250,000MEDLINE Literature 11,350,000Total 33,800,000

Many data objects, data sets updated frequently (daily) --> Keeping current data is a problem

Constellation of Bio-Data (SRS - Lion Bioscience)

Many databanks, variously structured,widely distributed, loosely federated - finding “best data” a problem

Genome database & info system components

Any genome database relies on, and feeds into, many other databases

BioGrid Schematic

• Grid-aware client software• Data and software directories

• Grid of processing computers

Moving Bio-Data on the Grid1. @virtualdata= biodirectory( "find protein coding sequences

for Drosophila and Anopheles species”)

2. @realdata= biodirectory( "get locators for @virtualdata split n ways”)

3. for i (1.. n) { copydata(realdata[i],gridcpu[i]); runapp(gridcpu[i]) }

Directories of Bio-Data• Directories are a necessary step for usable grids of bio-data

– "broad and shallow" directories federate the "narrow and deep" databases

• Bio-Data Access Tools

SRS, Sequence Retrieval System; Entrez ; AceDB; Genome relational databases (Ensembl, FlyBase, WormBase) ; IBM DiscoveryLink; BioDAS ; BioMoby

• Directory services for data access tools– Layer onto access tools for common query/retrieval of important data– LDAP: mature, efficient for high volumes, queries over distributed

directories ; works well with bio-access tools

– Web Services: XML messages over Web ; wide industry support , but standards are in progress

Bio-data Directory Needs• Build on existing technology for finding distributed objects• Efficient for millions of objects, by the gigabyte and terabyte• Queries distributed across directories of collaborating

services• Support existing and new bioinformatics data access

(relational dbs, object and XML dbs, SRS, Entrez, AceDB)• Simple client program methods for computable use of

directories• Flexible, common schema for describing objects• Replicate directories and objects among bioinformatics

centers• Peer-to-peer directories for collaborative projects• Strong authentication and security for data access

Directory technology: LDAP,Web Services and/or?

• LDAP– Object-centric, optimized for efficient read operations.

– Hierarchical, network-able, distributed and replicated in nature

– Has many features needed for bio-data access

• Web/XML – SOAP+: SOAP for directory requests, WSDL to interface the directory

repository, UDDI to locate the service (some assembly still required…)

– UDDI is potential match to LDAP as directory technology

– DSML: layer on top of LDAP for Web/XML interoperability

• Peer-to-peer (JXTA)? Grid SQL? XML-query systems? – Possible future directory technology

BioDirectory Tests• SRS bioinformatics data retrieval system, for efficient

retrieval of millions of bio-objects

• OpenLDAP for high performance and JavaLDAP for easy to configure directory transport.

• GLUE and Jakarta/Tomcat for Web Services tests. • DSML, Directory Services markup for XML/LDAP

conversion.• Test queries: 20,000 to 1.2 million biosequence objects from

GenBank, SwissProt and related dbs.

IUBio SRS Server + LDAP, WebServices --> Bio-object directory search/retrieval

Using Bio DirectoriesSimple client

software

Automated use

People use

Discovery

Search by many criteria

Retrieve bulk subsets

BioGrid Runner

A Globus CoG kit application for bioinformaticshttp://iubio.bio.indiana.edu/biogrid/runner/

Wrap up• Future of Bio-data on Grids

– Globus Toolkit useful for bio-grid data & compute intensive tools (BLAST, HMMer, Meme, others)

– High volume, complex, changing, distributed data– Add methods to find & move data among grid,

diretories of objects– LDAP works well ; Web-XML is usable, being defined

• Bio. Community needs and uses– Common data descriptions, schema, ontologies – Simple, practical, flexible grid methods ; use existing

dbs

See http://iubio.bio.indiana.edu/biogrid/

Documents

Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data