18
BODHI 1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of Science

BODHI1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of

Embed Size (px)

Citation preview

BODHI 1

BODHI,A Bio-diversity Database Pla(n)tform

Jayant HaritsaDatabase Systems Lab

Supercomputer Education and Research Centre

Indian Institute of Science

BODHI 2

Team

B. J. Srikanta (next talk)

Prof. Madhav GadgilProf. V. Nanjundiah(Centre for Ecological Sciences, IISc)

Several Masters Students

Funded by DBT

BODHI 3

Motivation

GATT – Patent Laws To be in place by 2005

Loss Neem Basmati (estimated export value: Rs. 1,198 crore) Turmeric

Global and local efforts GBIF (Global Biodiversity Information Facility) Karnataka Bio-diversity Board [Deccan Herald - Aug 26 2000]

BODHI 4

Bio-diversity Data

Taxonomy of species Phenetic (physical) characteristics Phylogenetic (evolutionary) characteristics

Habitat / Spatial distribution Political Layout Geographic Layout Biospheres

Genetic information Bio-molecular sequences Structural information

BODHI 5

MULTI-DOMAIN QUERY

Retrieve all plant species that share a common habitat, have identical Inflorescence characteristics, and have a DNA sequence within BLAST score of 80, with respect to “Michelia-champa”.

BODHI 6

Difficulties:

Complex range of data types sets, hierarchies, aggregations, sequences,

geometries, maps, audio, images …

Multidimensional data spatial (latitude, longitude, elevation) to

proteins (hundreds of coordinates)

Computationally-intensive operators species relationships, spatial distributions,

sequence alignments, ...

BODHI 7

Current Solutions

Small-Scale MS-Access / FoxPro / Excel / ... Pentium PCs

Large-Scale RDBMS: Oracle / DB2 / Informix / Sybase / … Unix servers: Sun / SGI / IBM / HP / ...

BODHI 8

Limitations:

RDBMS approach of “the world is a flat collection of tables with simple attributes”

suits financial applications,

NOT scientific (biological) applications In particular, taxonomic / spatial / sequence /

multimedia data modeling and processingare very cumbersome and coarse

BODHI 9

Limitations (contd)

Spatial and other applications are not within the database kernel but are connected externally. E.g. Many GIS systems have ArcInfo and MS-Access hooked up in a “black-box” manner. Or, Blast/FASTA utilizing sequence files generated from Oracle.

Problem: Slow and ugly!

BODHI 10

Is there Hope?

Object-Oriented DBMS “Natural” for biological applications

High-performance data access methods Path Dictionary Index, Multi-key Type Index,

Pyramid Tree, ...

High-performance specialized operators spatial join, data mining, sequence processing, …

XML = HTML + Semantics

BODHI 11

Goals of BODHI

Seamless integration of taxonomic, spatial and genomic data using OO technology

Latest access methods and operatorsfor all three types of data

Utilize XML for data exchangeLow-cost (ideally, free!)

BODHI 12

Architecture of BODHI

The Internet

Object Operations Genome Operations

Genome ModelSpatial Model

Spatial Operations

OBJECT STORAGE MANAGERSpatial Services Object Services Sequence Services

Taxonomy Model

Spatial Indexes Object Indexes Genome Indexes

Client Interface FrameworkQuery Processor

BODHI 13

Implementation of BODHI

The Internet

Inheritance Aggregation

AlignmentBLAST, FASTA

DNA, ProteinCountry, State,

City, River, Road

Overlaps, Contains,Closest, Within

SHORE MICRO-KERNEL

Spatial Services Object Services Sequence Services

Species, Genera, Family, Order

R*-tree, Hilbert-Rtree Multi-Key Type,Path-Dictionary

??? Indexes(next talk)

Client Interface Framework–DB

Basic Types (Point, Line, Polygon, Sets, Sequences, ...)

BODHI 15

Query Flow

BODHI 16

Project Status

Prototype (minus Client Interface Framework) is operational since last month !

Platform: PIII-700MHz running Redhat Linux.

For Code, contact “[email protected]

BODHI 17

Performance Evaluation

SEQUOIA 2000 spatial benchmark: Competitive with Paradise GIS from Wisconsin

Taxonomy + Spatial Queries: Reasonably fast

But Genomics slows things down a lot due to absence of indexes (next talk)

BODHI 18

More details

“Design and Implementation of a Biodiversity Information System”,Proc. of Intl. Conf. On Management of Data (COMAD), Pune, December 2000

“The Building of BODHI, A Bio-diversity Database System”,TechRep-2001-02, DSL/SERC, IISc

Available at http://dsl.serc.iisc.ernet.in

BODHI 19

End of Talk