29
BioPostgres BioPostgres Stott Parker & Stott Parker & Ruey-Lung Ruey-Lung Hsiao Hsiao UCLA Computer Science Dept. UCLA Computer Science Dept. www.biopostgres.org UCLA Center for Computational Biology (CCB)

BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

BioPostgresBioPostgres

Stott Parker & Stott Parker & Ruey-Lung Ruey-Lung HsiaoHsiaoUCLA Computer Science Dept.UCLA Computer Science Dept.

www.biopostgres.org

UCLA Center for

ComputationalBiology (CCB)

Page 2: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Evolution of scienceEvolution of science::

Increasing emphasis on:Increasing emphasis on:•• scalescale•• networkingnetworking•• informaticsinformatics

automationautomationData exploration scienceData exploration science

simulationsimulationComputational scienceComputational science

predictive modelingpredictive modelingAnalytical scienceAnalytical science

direct gathering of datadirect gathering of dataObservational scienceObservational science

The Future of ScienceThe Future of Science

Page 3: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

bScience bScience ?? eScienceeScience

•• large-scale, data-centric, computationally mind-numbing sciencelarge-scale, data-centric, computationally mind-numbing sciencehttp://research.http://research.microsoftmicrosoft.com/workshops/escience2005/.com/workshops/escience2005/

•• the future of sciencethe future of science (for (for eScientistseScientists): enormous centralized data centers): enormous centralized data centersthat provide manythat provide many information servicesinformation services

bSciencebScience•• NCBI is aNCBI is a good model for an good model for an eScience eScience data centerdata center

The importance of extensible database systemsThe importance of extensible database systems•• Jim Gray stresses the importance of extensible databases in Jim Gray stresses the importance of extensible databases in eScienceeScience::

future scientists will spend their days writing large-scale SQL queries (!)future scientists will spend their days writing large-scale SQL queries (!)

•• Even if Jim Gray is not right about this, at some point the scale of biologicalEven if Jim Gray is not right about this, at some point the scale of biologicalinformation requires using database systems for some kind of datainformation requires using database systems for some kind of datamanagement. Think about terabytesmanagement. Think about terabytes…… and and exabytesexabytes……

Page 4: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

CREATE TABLECREATE TABLE exon_featureexon_feature exon_feature exon_featureAS AS gene | gene | exon exon | protein | interval | type å | description| protein | interval | type å | descriptionSELECT DISTINCTSELECT DISTINCT ---------+------+---------+----------+---------------+-------------------- ---------+------+---------+----------+---------------+-------------------- e.gene, e.gene, H2-DMb2 | 2 | P35737 | 75..75 | H2-DMb2 | 2 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) e. e.exon_no exon_no asas exonexon,, H2-DMb2 | 3 | P35737 | 1..18 | signal | potential H2-DMb2 | 3 | P35737 | 1..18 | signal | potential f.protein_id as protein, f.protein_id as protein, H2-DMb2 | 4 | P35737 | 19..112 | domain | H2-DMb2 | 4 | P35737 | 19..112 | domain | lumenal lumenal beta-1beta-1 f.interval,f.interval, H2-DMb2 | 4 | P35737 | 75..75 | H2-DMb2 | 4 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) f.type, f.type, H2-DMb2 | 5 | P35737 | 113..207 | domain | H2-DMb2 | 5 | P35737 | 113..207 | domain | lumenal lumenal beta-2beta-2 f.description f.description H2-DMb2 | 5 | P35737 | 114..204 | domain | H2-DMb2 | 5 | P35737 | 114..204 | domain | Ig-likeIg-likeFROM FROM H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide gene g,gene g, H2-DMb2 | 6 | P35737 | 219..239 | H2-DMb2 | 6 | P35737 | 219..239 | transmembrane transmembrane | potential| potential exon exon e,e, H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif mrna mrna m,m, HLA-DMB | 1 | P28068 | 1..18 | signal | potential HLA-DMB | 1 | P28068 | 1..18 | signal | potential protein p,protein p, HLA-DMB | 2 | P28068 | 19..112 | domain | HLA-DMB | 2 | P28068 | 19..112 | domain | lumenal lumenal beta-1beta-1 protein_feature fprotein_feature f HLA-DMB | 2 | P28068 | 110..110 | HLA-DMB | 2 | P28068 | 110..110 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...)WHERE WHERE HLA-DMB | 3 | P28068 | 113..207 | domain | HLA-DMB | 3 | P28068 | 113..207 | domain | lumenal lumenal beta-2beta-2 e.gene = g.genee.gene = g.gene HLA-DMB | 3 | P28068 | 114..208 | domain | HLA-DMB | 3 | P28068 | 114..208 | domain | Ig-likeIg-like AND e.gene = m.geneAND e.gene = m.gene HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide AND e.gene = p.gene AND e.gene = p.gene HLA-DMB | 4 | P28068 | 219..239 | HLA-DMB | 4 | P28068 | 219..239 | transmembrane transmembrane | potential| potential AND p.protein_id = f.protein_id AND p.protein_id = f.protein_id HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif AND loc_range(e. AND loc_range(e.coding_mrnacoding_mrna)) + '-2..2'::range + '-2..2'::range - range_lower(loc_range(m. - range_lower(loc_range(m.coding_mrnacoding_mrna)))) @ -- contains @ -- contains (f.interval - 1) * 3; (f.interval - 1) * 3;

Databases take work to set up, butpermit exploration -- asking and

quickly getting answers to questions.

The join is a large-scale informationconnection operator that is both verybasic and very annoying to program.

Aligning protein features & Aligning protein features & exon exon structurestructure

Page 5: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

But: Biologists donBut: Biologists don’’t like databasest like databases

The main programming model is The main programming model is PerlPerl, not, not SQLSQL

DatabasesDatabases have negative associationshave negative associations•• InflexibilityInflexibility•• QuirkinessQuirkiness•• Possible slownessPossible slowness•• Possible expense !Possible expense !•• Operations often must be done outside the database;Operations often must be done outside the database;

SQL is not usually enoughSQL is not usually enough•• Steep learning curveSteep learning curve

Page 6: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Why do DBMS get an F in Biology?Why do DBMS get an F in Biology?

1.1. There arenThere aren’’t that many choices for DBMSt that many choices for DBMS•• …… and for open-source DBMS there are few and for open-source DBMS there are few

2.2. DBMS wereDBMS were designed for business, not sciencedesigned for business, not science•• hard-to-change database schemashard-to-change database schemas•• limited set of data typeslimited set of data types•• peculiarities of the SQL query languagepeculiarities of the SQL query language•• quirks of query optimizationquirks of query optimization•• arcane programming modelsarcane programming models

3.3. Strong challenges are inherent in bioscienceStrong challenges are inherent in bioscience•• very large scalevery large scale•• diverse types ofdiverse types of informationinformation•• extremely complex analytical queriesextremely complex analytical queries

Page 7: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Amazingly Few Large-Scale DBMS OptionsAmazingly Few Large-Scale DBMS Options

Commercial DBMSCommercial DBMS•• IBM DB2/DiscoveryLinkIBM DB2/DiscoveryLink•• Microsoft SQL ServerMicrosoft SQL Server•• Oracle 10gOracle 10g

10001000

4 GB4 GB

8 XB (=8 XB (= 8M8M TB)TB)

OracleOracle 10g10g

UnlimitedUnlimited UnlimitedUnlimited Database sizeDatabase size

250 250 —— 1600 1600 10001000 Columns/TableColumns/Table

1 GB1 GB 8 KB8 KB Field sizeField size

1.6 TB1.6 TB 8 KB8 KB Row sizeRow size

32 TB32 TB 64 TB64 TB Table sizeTable size

PostgreSQL PostgreSQL 8.18.1MySQL MySQL 4.14.1Maximum:Maximum:

Open-Source DBMSOpen-Source DBMS•• MySQLMySQL•• PostgreSQLPostgreSQL

Even though these have been designed with scalability as a keydesign goal, they have scalability limits that bScience will push:

Page 8: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

PostgreSQLPostgreSQL? Why not just ? Why not just MySQLMySQL??

MySQL MySQL and and PostgreSQL PostgreSQL are the primary scalable open-source DBMSare the primary scalable open-source DBMS Objective, feature-by-feature Comparisons:Objective, feature-by-feature Comparisons:

•• http:http://troels//troels..arvinarvin..dk/db/rdbms/dk/db/rdbms/•• http://en.http://en.wikipediawikipedia..org/wiki/Comparison_of_relational_database_manorg/wiki/Comparison_of_relational_database_man

agement_systemsagement_systems

More subjective, focal-issue Comparison:More subjective, focal-issue Comparison:

++++++++++ Installed BaseInstalled Base

++++++++++ ExtensibilityExtensibility

++++++++++ Stds Stds ComplianceCompliance++++++++++ SpeedSpeed

PostgreSQL PostgreSQL 8.18.1MySQL MySQL 4.14.1 Criterion:Criterion:

Page 9: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

CREATE TABLECREATE TABLE exon_featureexon_feature exon_feature exon_featureAS AS gene | gene | exon exon | protein | interval | type å | description| protein | interval | type å | descriptionSELECT DISTINCTSELECT DISTINCT ---------+------+---------+----------+---------------+-------------------- ---------+------+---------+----------+---------------+-------------------- e.gene, e.gene, H2-DMb2 | 2 | P35737 | 75..75 | H2-DMb2 | 2 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) e. e.exon_no exon_no asas exonexon,, H2-DMb2 | 3 | P35737 | 1..18 | signal | potential H2-DMb2 | 3 | P35737 | 1..18 | signal | potential f.protein_id as protein, f.protein_id as protein, H2-DMb2 | 4 | P35737 | 19..112 | domain | H2-DMb2 | 4 | P35737 | 19..112 | domain | lumenal lumenal beta-1beta-1 f.interval,f.interval, H2-DMb2 | 4 | P35737 | 75..75 | H2-DMb2 | 4 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) f.type, f.type, H2-DMb2 | 5 | P35737 | 113..207 | domain | H2-DMb2 | 5 | P35737 | 113..207 | domain | lumenal lumenal beta-2beta-2 f.description f.description H2-DMb2 | 5 | P35737 | 114..204 | domain | H2-DMb2 | 5 | P35737 | 114..204 | domain | Ig-likeIg-likeFROM FROM H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide gene g,gene g, H2-DMb2 | 6 | P35737 | 219..239 | H2-DMb2 | 6 | P35737 | 219..239 | transmembrane transmembrane | potential| potential exon exon e,e, H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif mrna mrna m,m, HLA-DMB | 1 | P28068 | 1..18 | signal | potential HLA-DMB | 1 | P28068 | 1..18 | signal | potential protein p,protein p, HLA-DMB | 2 | P28068 | 19..112 | domain | HLA-DMB | 2 | P28068 | 19..112 | domain | lumenal lumenal beta-1beta-1 protein_feature fprotein_feature f HLA-DMB | 2 | P28068 | 110..110 | HLA-DMB | 2 | P28068 | 110..110 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...)WHERE WHERE HLA-DMB | 3 | P28068 | 113..207 | domain | HLA-DMB | 3 | P28068 | 113..207 | domain | lumenal lumenal beta-2beta-2 e.gene = g.genee.gene = g.gene HLA-DMB | 3 | P28068 | 114..208 | domain | HLA-DMB | 3 | P28068 | 114..208 | domain | Ig-likeIg-like AND e.gene = m.geneAND e.gene = m.gene HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide AND e.gene = p.gene AND e.gene = p.gene HLA-DMB | 4 | P28068 | 219..239 | HLA-DMB | 4 | P28068 | 219..239 | transmembrane transmembrane | potential| potential AND p.protein_id = f.protein_id AND p.protein_id = f.protein_id HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif AND loc_range(e. AND loc_range(e.coding_mrnacoding_mrna)) + '-2..2'::range + '-2..2'::range - range_lower(loc_range(m. - range_lower(loc_range(m.coding_mrnacoding_mrna)))) @ -- contains @ -- contains (f.interval - 1) * 3; (f.interval - 1) * 3;

What is really needed here is a newdatatype for sequence locations.This query is painful to express

without this, and not painful with it.

Aligning protein features & Aligning protein features & exon exon structurestructure

Page 10: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

BioPostgres BioPostgres -- some modular database-- some modular databaseinfrastructure for Computational Biologyinfrastructure for Computational Biology

BioPostgres BioPostgres = = PostgreSQL PostgreSQL + Extensions+ Extensions•• PostgreSQLPostgreSQL: an open-source, industrial-strength, scalable DBMS: an open-source, industrial-strength, scalable DBMS

http://www.postgresql.orghttp://www.postgresql.org

•• Extension: a new SQL API, with query operators and toolsExtension: a new SQL API, with query operators and tools

Each Extension is a separate packageEach Extension is a separate package

•• Biosequence Biosequence extensionsextensions•• GeneOntology GeneOntology extensionsextensions

•• SQL SQL datatype datatype extensionsextensions•• System management extensionsSystem management extensions

Working toward complementing other BOSC platforms:Working toward complementing other BOSC platforms: BioPerlBioPerl, , BioPythonBioPython, , BioJavaBioJava, , BioSQLBioSQL, , BioConductorBioConductor, GMOD, , GMOD, ……

Page 11: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Extensibility features of Extensibility features of PostgreSQLPostgreSQL PostgreSQL was designed specifically to be extensible

Individual databases can be extended with:• New datatypes• New functions• New query operators• New indexing methods• New query languages

These can be added or dropped anytime, on the fly• dynamic linking of implementation libraries as needed

Flexible conventions for user-contributed modules• implementations are typically in C (like PostgreSQL).

Page 12: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Extending Extending PostgreSQLPostgreSQL Within a given database, oneWithin a given database, one can add new can add new datatypesdatatypes A A datatype datatype can be added to can be added to allall databases also databases also

PostgreSQLMyBioStuff(PostgreSQLdatabase)

Graph datatype& query operators

Seq Location datatype& query operators

Page 13: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

PostgreSQL PostgreSQL user-contributed modulesuser-contributed modules A new module (say A new module (say ““PostFooPostFoo””) typically contains) typically contains

•• New functions (in New functions (in PostFooPostFoo.c.c))

•• SQL interface bindings for these functions (in SQL interface bindings for these functions (in PostFooPostFoo..sqlsql))

The The PostFoo PostFoo module gets downloaded by others into their copymodule gets downloaded by others into their copy

of the of the PostgreSQL PostgreSQL source tree as a new directorysource tree as a new directory

postgresql-8.*.*postgresql-8.*.*/contrib/PostFoo//contrib/PostFoo/

In this directory, the source tree ownerIn this directory, the source tree owner typestypes

gmakegmake

gmake gmake installinstall # as root# as root

This compiles only the module (NOT This compiles only the module (NOT PostgreSQLPostgreSQL)!)!

Afterwards anyone canAfterwards anyone can dynamically add the module to any givendynamically add the module to any givenPostgreSQL PostgreSQL database (say database (say ““MyBioStuffMyBioStuff””) with a command like) with a command like

psql -d MyBioStuff psql -d MyBioStuff < < PostFooPostFoo..sqlsql

SeeSee www.www.biopostgresbiopostgres.org/install.html.org/install.html

Page 14: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

BioPostgres BioPostgres ModulesModules

Derivation dependency extensionsDerivation dependency extensionsPostMakeModel base/Model base/ddata mining extensionsata mining extensionsPostModelGraphGraph database extensionsdatabase extensionsPostGraph

BioPostgres is a collection of modules that extendPostgreSQL for Computational Biology. Thesemodules are basically independent, and can beused separately or in conjunction with others

GeneOntology GeneOntology (GO)(GO) analysisanalysisGObaseBiosequence Biosequence data analysisdata analysisBLASTgres

Page 15: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Quick overview: Quick overview: BLASTgresBLASTgres

BLASTgres BLASTgres -- extensions for -- extensions for biosequence biosequence managementmanagement

Sequence location Sequence location datatypedatatype:: ((seq_idseq_id, [start,end]), [start,end]) Sequence location operatorsSequence location operators:: loc intersection, etc. loc intersection, etc. Sequence location query:Sequence location query: find overlapping find overlapping locslocs, etc., etc.

Access to BLAST services:Access to BLAST services: remote and local serversremote and local servers

URL:URL: http://www.biopostgres.org/BLASTgres/http://www.biopostgres.org/BLASTgres/

Page 16: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Two sets of Two sets of BLASTgres BLASTgres extensionsextensions

1. 1. BLASTgres BLASTgres provides provides ““BLAST queryBLAST query””, , ““BLAST hit databaseBLAST hit database””

SELECT * FROM SELECT * FROM blast_sequence(blast_sequence(……););

2. 2. BLASTgres BLASTgres provides provides biosequence-related datatypesbiosequence-related datatypes, with, withaccompanying query operators and indexing methods:accompanying query operators and indexing methods:

Sequence range (and: array of range)Sequence range (and: array of range)‘‘17679235..1767942717679235..17679427’’

Sequence location (and:Sequence location (and: array of location)array of location)‘‘NT_011109.15[17679235..17679427]NT_011109.15[17679235..17679427]’’

Hit Hit (= high-scoring sequence alignment information)(= high-scoring sequence alignment information)((‘‘in1[353..966]in1[353..966]’’, , ’’HUMAPOE4[3779..4402]HUMAPOE4[3779..4402]’’ 99.84, 624, 1, 0, 0, 1229) 99.84, 624, 1, 0, 0, 1229)

Page 17: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

BLAST access via BLAST access via BLASTgres BLASTgres queriesqueries1. Simple BLAST queries

SELECT * FROM local_blast_hit(‘atcgatcgatcg’, ‘lab-sequences’, ‘blastn’);SELECT * FROM remote_blast_hps(‘lab_protein.fasta’, ‘nr’, ‘blastp’);SELECT * FROM remote_blast_hit(‘NM_010387.2[245..546]’, ‘nr’, ‘blastn’) ;SELECT * FROM fasta_sequence( ‘/lab/ests’, ‘P082345’, 20, 40 );

2. Annotations to BLAST query resultsSELECT count(*), species FROM annotated_remote_blast_hit(‘AF101044’, ‘nr’, ‘blastn’) GROUP BY species; -- automatic annotation to BLAST query results

SELECT * FROM annotated_remote_blast_hit(‘AF101044’, ‘nr’) WHERE description LIKE ‘%SNRPN%’, and species <> ‘Homo sapiens’;

3. Advanced filtering for BLAST results in SQLSELECT subject_location, length FROM local_blast_hit(‘AF101044’) WHERE evalue < 1E-5 AND bitscore > 800 AND mismatches<10 AND identity>40;

SELECT * FROM local_blast_hit(‘actgactgactgactg’, ‘ESTs’, ‘blastn’) A, features B WHERE A.subject_location && B.feature_location -- && means “overlaps”

4. Large-scale BLAST querySELECT * FROM local_blast_hit_all( ‘sequences’, ‘seq’, ‘ESTs’, ‘blastn’);-- BLAST multiple sequences at the same time.

-- sensitivity test (comparison of BLAST query results using different parameters): CREATE TABLE parameter1 ( name TEXT, value TEXT );INSERT INTO parameter1 VALUES ( ‘WORD_SIZE’, ‘7’ ); -- change the default settingsINSERT INTO parameter1 VALUES ( ‘GAPCOSTS’, ‘5 2’ );CREATE TABLE parameter2 ( name TEXT, value TEXT );INSERT INTO parameter2 VALUES ( ‘WORD_SIZE’, ‘13’ );INSERT INTO parameter2 VALUES ( ‘GAPCOSTS’, ‘3 1’ );-- retrieve matches that are not found in both resultsSELECT R1.*, R2.* FROM remote_blast_hit_v( ‘AF101044’, ‘parameter1’) R1, remote_blast_hit_v( ‘AF101044’, ‘parameter2’) R2 WHERE (R1.subject_location && R2.subject_location) AND (R1.length <> R2.length );

Page 18: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

BLASTgres BLASTgres functionsfunctionsRange Operatorsrange + rangerange + int8 int8 +range range -range range ミ int8range * rangerange * int4 int4 *range range |range range <range range<=range range >range range >=range range <<range range &<range range &&range range &>range range >>range range =range range <>range range @range range ~range range @<range range @>range RangeAggregate minmax(range)

Location Functions

loc_range(loc)loc_seqid(loc)loc_size(loc)loc_lower(loc)loc_upper(loc)loc_positive_strand(loc)loc_negative_strand(loc)loc_same_strand(loc,loc)loc_negate(loc)loc_eq(loc, loc)loc_eq(loc, range)loc_ne(loc, loc)loc_ne(loc, range)loc_over_left(loc, loc)loc_over_left(loc, range)loc_over_right(loc, loc)loc_over_right(loc, range)loc_left(loc,loc)loc_left(loc,range)loc_right(loc,loc)loc_right(loc,range)loc_lt(loc,loc)loc_lt(loc,range)loc_le(loc,loc)loc_le(loc,range)loc_gt(loc,loc)loc_gt(loc,range)loc_ge(loc,loc)loc_ge(loc,range)

Aggregate functions

coalescing( text, text, text, int4 )coalescing( text, text, text)

partition(text, text, text, text, text )

revcom(text)transcribe(text)translate(text)translate(text, int4)

range_agg_state (range[], range)range_agg_final_array (range[])range_array_aggregate (range_array_enum(range[])

loc_agg_state (loc[], loc)loc_agg_final_array (loc[])loc_array_aggregateloc_array_enum(loc[])

rangeset(range)rcount(_range)

sort(_range, text)sort(_range)sort_asc(_range)sort_desc(_range)uniq(_range)

idx(_range, range)subarray(_range, int4, int4)subarray(_range, int4)

Range Functions

range_over_left(range, range)range_over_right(range, range)range_left(range, range)range_right(range, range)range_lt(range, range)range_le(range, range)range_gt(range, range)range_ge(range, range)range_ne(range, range)range_inside(range,range)range_contains(range, range)range_contained(range, range)range_overlaps(range, range)range_eq(range, range)range_meets(range, range)range_met_by(range, range)range_starts(range, range)range_started_by(range, range)range_finishes(range, range)range_finished_by(range, range)range_same_lower(range, range)range_same_upper(range, range)range_minus(range, range)range_plus(range, range)range_torange(int8, int8)range_maxmin(range, range)range_minmax(range, range)range_extend(range, int4)range_cmp(range, range)range_union(range, range)range_inter(range, range)range_size(range)range_upper(range)range_lower(range)range_times(range,range)range_times(range, int4)range_times(int4, range)range_plus(range, int8)range_plus(int8, range)range_minus(range, int8)

LocationOperatorsloc + locloc +range loc+ int8 loc -loc loc -range loc ミint8 loc *loc loc *range loc *int4 loc =loc loc <>loc loc <>range loc< loc loc <range loc<= loc loc<= rangeloc > locloc >range loc>= loc loc>= rangeloc << locloc <<range loc&< loc loc&< rangeloc && locloc &&range loc&> loc loc&> rangeloc >> locloc >>range loc@ loc loc@ rangeloc ~ locloc ~range

BLAST-related functions

fasta_sequence(text,text,int,int)fasta_sequence(text,text)fasta_dbinfo(text)

remote_blast_hit(text,text,text)remote_blast_hit( text, text )remote_blast_hit(text)remote_blast_hsp(text,text,text)remote_blast_hsp( text, text )remote_blast_hsp(text)

local_blast_hsp(text,text,text,text)local_blast_hsp( text, text, text )local_blast_hit(text,text,text,text)local_blast_hit( text, text, text )local_ublast_hit( text, text, text )local_ublast_hit( text, text )local_ublast_hit( text )genbank_search(text)

loc_contains(loc,loc)loc_contains(loc,range)loc_contained(loc,loc)loc_contained(range,loc)loc_overlaps(loc,loc)loc_overlaps(loc,range)loc_meets(loc,loc)loc_meets(loc,range)loc_met_by(loc,loc)loc_met_by(loc,range)loc_starts(loc,loc)loc_starts(loc,range)loc_started_by(loc,loc)loc_started_by(loc,range)loc_finishes(loc,loc)loc_finishes(loc,range)loc_finished_by(loc,loc)loc_finished_by(loc,range)loc_minus(loc,loc)loc_minus(loc, integer)loc_minus(loc, range)loc_plus(loc,loc)loc_plus(loc, integer)loc_plus(integer, loc)loc_plus(loc, range)loc_times(loc,loc)loc_times(loc, integer)loc_times(integer,loc)loc_times(loc, range)toloc(cstring, int8, int8)toloc(text, int8, int8)toloc(cstring, range)toloc(text, range)loc_maxmin(loc,loc)loc_minmax(loc,loc)loc_extend(loc,int4)loc_maxmin(loc,range)loc_minmax(loc,range)

_range_concat(_range,_range)_range_overlaps(_range,_range)_range_contains(_range,_range)_range_contained(_range,_range)_range_eq(_range,_range)_range_ne(_range,_range)_range_union(_range,_range)_range_inters(_range,_range)_range_push_elem(_range,RANGE)_range_push_array(_range,_range)_range_del_elem(_range,RANGE)_range_union_elem(_range,RANGE)_range_subtract(_range,_range)_range_contains(_range,range)_range_contains_interval_any(_range, range)_range_contains_interval_all(_range, range)_range_contained_interval_any(_range, range)_range_contained_interval_all(_range, range)_range_overlaps_interval_any(_range, range)_range_overlaps_interval_all(_range, range)_range_in(cstring)_range_out(_range_key)

Page 19: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

CREATE TABLECREATE TABLE exon_featureexon_feature exon_feature exon_featureAS AS gene | gene | exon exon | protein | interval | type å | description| protein | interval | type å | descriptionSELECT DISTINCTSELECT DISTINCT ---------+------+---------+----------+---------------+-------------------- ---------+------+---------+----------+---------------+-------------------- e.gene, e.gene, H2-DMb2 | 2 | P35737 | 75..75 | H2-DMb2 | 2 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) e. e.exon_no exon_no asas exonexon,, H2-DMb2 | 3 | P35737 | 1..18 | signal | potential H2-DMb2 | 3 | P35737 | 1..18 | signal | potential f.protein_id as protein, f.protein_id as protein, H2-DMb2 | 4 | P35737 | 19..112 | domain | H2-DMb2 | 4 | P35737 | 19..112 | domain | lumenal lumenal beta-1beta-1 f.interval,f.interval, H2-DMb2 | 4 | P35737 | 75..75 | H2-DMb2 | 4 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) f.type, f.type, H2-DMb2 | 5 | P35737 | 113..207 | domain | H2-DMb2 | 5 | P35737 | 113..207 | domain | lumenal lumenal beta-2beta-2 f.description f.description H2-DMb2 | 5 | P35737 | 114..204 | domain | H2-DMb2 | 5 | P35737 | 114..204 | domain | Ig-likeIg-likeFROM FROM H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide gene g,gene g, H2-DMb2 | 6 | P35737 | 219..239 | H2-DMb2 | 6 | P35737 | 219..239 | transmembrane transmembrane | potential| potential exon exon e,e, H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif mrna mrna m,m, HLA-DMB | 1 | P28068 | 1..18 | signal | potential HLA-DMB | 1 | P28068 | 1..18 | signal | potential protein p,protein p, HLA-DMB | 2 | P28068 | 19..112 | domain | HLA-DMB | 2 | P28068 | 19..112 | domain | lumenal lumenal beta-1beta-1 protein_feature fprotein_feature f HLA-DMB | 2 | P28068 | 110..110 | HLA-DMB | 2 | P28068 | 110..110 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...)WHERE WHERE HLA-DMB | 3 | P28068 | 113..207 | domain | HLA-DMB | 3 | P28068 | 113..207 | domain | lumenal lumenal beta-2beta-2 e.gene = g.genee.gene = g.gene HLA-DMB | 3 | P28068 | 114..208 | domain | HLA-DMB | 3 | P28068 | 114..208 | domain | Ig-likeIg-like AND e.gene = m.geneAND e.gene = m.gene HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide AND e.gene = p.gene AND e.gene = p.gene HLA-DMB | 4 | P28068 | 219..239 | HLA-DMB | 4 | P28068 | 219..239 | transmembrane transmembrane | potential| potential AND p.protein_id = f.protein_id AND p.protein_id = f.protein_id HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif AND loc_range(e. AND loc_range(e.coding_mrnacoding_mrna)) ++ '-2..2'::range '-2..2'::range -- range_lower(loc_range(m. range_lower(loc_range(m.coding_mrnacoding_mrna)))) @@ -- contains -- contains (f.interval (f.interval -- 1) 1) ** 3; 3;

BLASTgres query operators:X && Y == range_overlaps(X,Y)X @ Y == range_contains(X,Y)

X * Y == range_times(X,Y)X + Y == range_plus(X,Y)

etc…

Aligning protein features & Aligning protein features & exon exon structurestructure

Page 20: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

IndexingIndexing for the for the locloc & & rangerange datatypesdatatypes

(1,2)(4,160)

(3,4)(21,203)

(1,1)(4,83)

(1,2)(40,160)

(3,4)(21,78)

(4,4)(56,203)

1(4,70)

1(32,83)

1(53,160)

2(40,93)

3(40,46)

4(160,203)

4(21,78)

4(56,120)

A

B C

D E F G

4(60,80)

predicate: overlaps

Search keys are representedas two bounding intervals (forleaf nodes, the lower boundis equal to upper bound forthe first bounding intervals).The bounding interval of aninternal node contains thoseof its subnodes.

Page 21: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

GiSTGiST Indexing in Indexing in BLASTgresBLASTgresGiST GiST (Generalized Index Search Tree) is(Generalized Index Search Tree) is the the PostgreSQL PostgreSQL framework for user-defined indexing.framework for user-defined indexing.

BLASTgres BLASTgres uses uses GiST GiST indexing for locations, ranges, location arrays, and range arrays.indexing for locations, ranges, location arrays, and range arrays.

(1,2)(4,160)

(3,4)(21,203)

(1,1)(4,83)

(1,2)(40,160)

(3,4)(21,78)

(4,4)(56,203)

1(4,70)

1(32,83)

1(53,160)

2(40,93)

3(40,46)

4(160,203)

4(21,78)

4(56,120)

A

B C

D E F G

4(60,80)GiST GiST indexing requires userindexing requires user

definition definition ofof four methods: four methods:

••ConsistentConsistent : determine if a: determine if asubtree subtree traversal is necessary.traversal is necessary.

••UnionUnion : merge two nodes. : merge two nodes.

••PenaltyPenalty : determine the cost of : determine the cost ofinserting an entry in a node.inserting an entry in a node.

••PickSplit PickSplit : determine how to: determine how tosplit a full nodesplit a full node

Page 22: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

BioPostgres BioPostgres ModulesModules

Derivation dependency extensionsDerivation dependency extensionsPostMakeModel base/Model base/ddata mining extensionsata mining extensionsPostModelGraphGraph database extensionsdatabase extensionsPostGraph

Many modules are needed! We have only just begun todevelop modules we feel are most clearly needed.Please contact us if you have opinions or ideas.

GeneOntology GeneOntology (GO)(GO) analysisanalysisGObaseBiosequence Biosequence data analysisdata analysisBLASTgres

Page 23: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Quick overview: Quick overview: PostGraphPostGraph

PostGraph PostGraph -- extensions for graph management-- extensions for graph management

Graph Graph datatypedatatype:: each graph is a single data objecteach graph is a single data object Graph operatorsGraph operators:: insert edge, delete node, etc. insert edge, delete node, etc. Graph query:Graph query: find connected components, etc.find connected components, etc.

Access to graph tools:Access to graph tools: dot/graphvizdot/graphviz, , prefuseprefuse, etc., etc.

URL:URL: http://www.http://www.biopostgresbiopostgres.org/PostGraph/.org/PostGraph/

Page 24: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

GeneOntology GeneOntology relational databaserelational database

GO is naturally a graph, distributed as relational tablesGO is naturally a graph, distributed as relational tables This representation is fine for storage but can beThis representation is fine for storage but can be

awkward for explorationawkward for exploration

SELECT id, name, term_type, acc FROM term LIMIT 7;

id | name | term_type | acc------+-------------------------------+--------------------+---------- 1 | all | universal | all 2 | is_a | relationship | is_a 3 | part_of | gene_ontology | part_of 4 | mitochondrion inheritance | biological_process | GO:0000001 5 | mitochondrial genome maintena | biological_process | GO:0000002 6 | reproduction | biological_process | GO:0000003 7 | alt_id | synonym_type | alt_id

SELECT * FROM term2term LIMIT 7;

id | relationship_type_id | term1_id | term2_id | complete ----+----------------------+----------+----------+---------- 1 | 2 | 10 | 9 | 0 2 | 2 | 10 | 13 | 0 3 | 2 | 26 | 25 | 0 4 | 2 | 10 | 42 | 0 5 | 2 | 10 | 50 | 0 6 | 2 | 26 | 67 | 0 7 | 2 | 26 | 92 | 0

Page 25: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Augmenting the Augmenting the GeneOntologyGeneOntologydatabase with a graph databasedatabase with a graph database

PostGraph PostGraph permits GO terms to be stored & queried as graphspermits GO terms to be stored & queried as graphs GOtermViewer GOtermViewer uses this to provide a uses this to provide a GraphGraphical Interface for GOical Interface for GO These are extended in These are extended in GObase GObase www.biopostgres.org/GObase/www.biopostgres.org/GObase/

Page 26: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Quick overview: Quick overview: PostMakePostMake

PostMake PostMake -- extensions for managing data dependencies-- extensions for managing data dependencies

PostMake PostMake covers common dependencies in bioscience:covers common dependencies in bioscience:•• PostCronPostCron:: database operations at pre-specified times database operations at pre-specified times

((““crontabcrontab within a databasewithin a database””))•• Database dump/loadDatabase dump/load, tracking when things change, tracking when things change•• Materialized viewsMaterialized views, triggered by updates, triggered by updates

((““makemake within within a databasea database””)) PostMakefiles PostMakefiles are translated to SQL DDLare translated to SQL DDL

URL:URL: http://www.biopostgres.org/PostMake/http://www.biopostgres.org/PostMake/

Page 27: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

BioPostgres BioPostgres and open-sourceand open-source

Like Like PostgreSQLPostgreSQL, , BioPostgres BioPostgres is committed tois committed toopennessopenness

BioPostgres BioPostgres modules aremodules are GPLGPL’’eded; most code so far; most code so faris in C, SQL, and Javais in C, SQL, and Java

BioPostgres BioPostgres modules are downloadablemodules are downloadable fromfromSourceForgeSourceForge

We think We think BioPostgres BioPostgres shows that extensible DBMSshows that extensible DBMShave a lot to offer the open-source movement,have a lot to offer the open-source movement,particularly in bioscience.particularly in bioscience.

Page 28: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

Will DBMS ever get an A in Biology?Will DBMS ever get an A in Biology?

Actually we donActually we don’’t know.t know.

We do not think We do not think DBMS will ever replace files orDBMS will ever replace files orprogramming languages in bioscience.programming languages in bioscience.

We think however that scalability is aWe think however that scalability is a drivingdrivingissue, and it willissue, and it will impact the effectiveness ofimpact the effectiveness offiles and programming languages in bioscience.files and programming languages in bioscience.And soon.And soon.

It does appear that It does appear that extensibile extensibile DBMSDBMS cancansuggest steps in the right direction.suggest steps in the right direction.

Page 29: BioPostgres Overview BOSC 2006 - CSstott/BioPostgres_Overview_BOSC_2006.pdf · UCLA Computer Science Dept. UCLA Center for Computational Biology (CCB) Evolution of science: Increasing

THANKTHANKYOUYOU!!