Upload
hilmar-lapp
View
792
Download
0
Tags:
Embed Size (px)
Citation preview
BioSQL Reloaded: v1.0 Release, PhyloDB Module,
and Future Features
Hilmar Lapp (NESCent),Richard Holland, Aaron Mackey, William Piel, Mark Schreiber
BOSC 2008
What is BioSQL?
Interoperable persistence layer for Bio* supporting
• BioPerl
• Biojava
• Biopython
• BioRuby
What is BioSQL?
Generic & extensible relational model
• sequences
• features
• sequence and feature annotation
• a reference taxonomy
• ontologies, controlled vocabularies
• phylogenetic trees or networks
What is BioSQL?Generic & extensible relational model
• sequences
• features
• sequence and feature annotation
• a reference taxonomy
• ontologies, controlled vocabularies
• phylogenetic trees or networks
A Brief History
• Ewan Birney started BioSQL and Bioperl-db in Nov 2001
• Major redesigns and refactorings at several BioHackathons in 2002-2003
• PhyloDB module added at 2006 Phyloinformatics Hackathon
• Reinvigorated at 2008 BioHackathon
• v1.0 released in March 2008
Use Cases
1) Local ‘GenBank’ with random access
2) ‘GenBank’ in relational format
3) Interoperable Bio* persistence
4) My lab sequence database
5) Integrate sequence & annotation databases
• Website: http://biosql.org
• Mailing list: [email protected]
• Subversion: svn://code.open-bio.org/biosql/biosql-schema
• Bugs: http://bugzilla.open-bio.org
BioSQL 1.0 -- Relational Model
1, 1 / 1, 2 -- 7:45:35 PM , 6/4/2003
Annotation Bundle
Bioentry withTaxon and Namespace
Seqfeatureswith Locationand Annotation Ontology Terms
and Relationships
Biodatabase
Biodatabase Id
NameAuthorityDescription
Taxon
Taxon Id
Ncbi Taxon IdParent Taxon Id (FK)Node RankGenetic CodeMito Genetic CodeLeft ValueRight Value
Taxon Name
Taxon Id (FK)NameName Class
Ontology
Ontology Id
NameDefinition
Term
Term Id
NameDefinitionIdentifierIs ObsoleteOntology Id (FK)
Term Synonym
Term Id (FK)Synonym
Term Dbxref
Term Id (FK)Dbxref Id (FK)
Rank
Term Relationship
Term Relationship Id
Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)
Term PathTerm Path Id
Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)Distance
Bioentry
Bioentry Id
Biodatabase Id (FK)Taxon Id (FK)NameAccessionIdentifierDivisionDescriptionVersion
Bioentry RelationshipBioentry Relationship Id
Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Rank
Bioentry Path
Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Distance
Biosequence
Bioentry Id (FK)
AlphabetVersionLengthSeqDbxref
Dbxref Id
DbnameAccessionVersion
Dbxref Qualifier Value
Dbxref Id (FK)Term Id (FK)Rank
Value
Bioentry Dbxref
Bioentry Id (FK)Dbxref Id (FK)
Rank
Reference
Reference Id
Dbxref Id (FK)LocationTitleAuthorsCrc
Bioentry Reference
Bioentry Id (FK)Reference Id (FK)Rank
Start PosEnd Pos
CommentComment Id
Bioentry Id (FK)Comment TextRank
Bioentry Qualifier Value
Bioentry Id (FK)Term Id (FK)ValueRank
Seqfeature
Seqfeature Id
Bioentry Id (FK)Type Term Id (FK)Source Term Id (FK)Display NameRank
Seqfeature Relationship
Seqfeature Relationship Id
Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Rank
Seqfeature Path
Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Distance
Seqfeature Qualifier ValueSeqfeature Id (FK)Term Id (FK)Rank
Value
Seqfeature Dbxref
Seqfeature Id (FK)Dbxref Id (FK)
Rank
Location
Location Id
Seqfeature Id (FK)Dbxref Id (FK)Term Id (FK)Start PosEnd PosStrandRank
Location Qualifier ValueLocation Id (FK)Term Id (FK)
ValueInt Value
BioSQL Relational Model
BioSQL 1.0 -- Relational Model
1, 1 / 1, 2 -- 7:45:35 PM , 6/4/2003
Annotation Bundle
Bioentry withTaxon and Namespace
Seqfeatureswith Locationand Annotation Ontology Terms
and Relationships
Biodatabase
Biodatabase Id
NameAuthorityDescription
Taxon
Taxon Id
Ncbi Taxon IdParent Taxon Id (FK)Node RankGenetic CodeMito Genetic CodeLeft ValueRight Value
Taxon Name
Taxon Id (FK)NameName Class
Ontology
Ontology Id
NameDefinition
Term
Term Id
NameDefinitionIdentifierIs ObsoleteOntology Id (FK)
Term Synonym
Term Id (FK)Synonym
Term Dbxref
Term Id (FK)Dbxref Id (FK)
Rank
Term Relationship
Term Relationship Id
Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)
Term PathTerm Path Id
Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)Distance
Bioentry
Bioentry Id
Biodatabase Id (FK)Taxon Id (FK)NameAccessionIdentifierDivisionDescriptionVersion
Bioentry RelationshipBioentry Relationship Id
Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Rank
Bioentry Path
Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Distance
Biosequence
Bioentry Id (FK)
AlphabetVersionLengthSeqDbxref
Dbxref Id
DbnameAccessionVersion
Dbxref Qualifier Value
Dbxref Id (FK)Term Id (FK)Rank
Value
Bioentry Dbxref
Bioentry Id (FK)Dbxref Id (FK)
Rank
Reference
Reference Id
Dbxref Id (FK)LocationTitleAuthorsCrc
Bioentry Reference
Bioentry Id (FK)Reference Id (FK)Rank
Start PosEnd Pos
CommentComment Id
Bioentry Id (FK)Comment TextRank
Bioentry Qualifier Value
Bioentry Id (FK)Term Id (FK)ValueRank
Seqfeature
Seqfeature Id
Bioentry Id (FK)Type Term Id (FK)Source Term Id (FK)Display NameRank
Seqfeature Relationship
Seqfeature Relationship Id
Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Rank
Seqfeature Path
Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Distance
Seqfeature Qualifier ValueSeqfeature Id (FK)Term Id (FK)Rank
Value
Seqfeature Dbxref
Seqfeature Id (FK)Dbxref Id (FK)
Rank
Location
Location Id
Seqfeature Id (FK)Dbxref Id (FK)Term Id (FK)Start PosEnd PosStrandRank
Location Qualifier ValueLocation Id (FK)Term Id (FK)
ValueInt Value
BioSQL 1.0 -- Relational Model
1, 1 / 1, 2 -- 7:45:35 PM , 6/4/2003
Annotation Bundle
Bioentry withTaxon and Namespace
Seqfeatureswith Locationand Annotation Ontology Terms
and Relationships
Biodatabase
Biodatabase Id
NameAuthorityDescription
Taxon
Taxon Id
Ncbi Taxon IdParent Taxon Id (FK)Node RankGenetic CodeMito Genetic CodeLeft ValueRight Value
Taxon Name
Taxon Id (FK)NameName Class
Ontology
Ontology Id
NameDefinition
Term
Term Id
NameDefinitionIdentifierIs ObsoleteOntology Id (FK)
Term Synonym
Term Id (FK)Synonym
Term Dbxref
Term Id (FK)Dbxref Id (FK)
Rank
Term Relationship
Term Relationship Id
Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)
Term PathTerm Path Id
Subject Term Id (FK)Predicate Term Id (FK)Object Term Id (FK)Ontology Id (FK)Distance
Bioentry
Bioentry Id
Biodatabase Id (FK)Taxon Id (FK)NameAccessionIdentifierDivisionDescriptionVersion
Bioentry RelationshipBioentry Relationship Id
Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Rank
Bioentry Path
Object Bioentry Id (FK)Subject Bioentry Id (FK)Term Id (FK)Distance
Biosequence
Bioentry Id (FK)
AlphabetVersionLengthSeqDbxref
Dbxref Id
DbnameAccessionVersion
Dbxref Qualifier Value
Dbxref Id (FK)Term Id (FK)Rank
Value
Bioentry Dbxref
Bioentry Id (FK)Dbxref Id (FK)
Rank
Reference
Reference Id
Dbxref Id (FK)LocationTitleAuthorsCrc
Bioentry Reference
Bioentry Id (FK)Reference Id (FK)Rank
Start PosEnd Pos
CommentComment Id
Bioentry Id (FK)Comment TextRank
Bioentry Qualifier Value
Bioentry Id (FK)Term Id (FK)ValueRank
Seqfeature
Seqfeature Id
Bioentry Id (FK)Type Term Id (FK)Source Term Id (FK)Display NameRank
Seqfeature Relationship
Seqfeature Relationship Id
Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Rank
Seqfeature Path
Object Seqfeature Id (FK)Subject Seqfeature Id (FK)Term Id (FK)Distance
Seqfeature Qualifier ValueSeqfeature Id (FK)Term Id (FK)Rank
Value
Seqfeature Dbxref
Seqfeature Id (FK)Dbxref Id (FK)
Rank
Location
Location Id
Seqfeature Id (FK)Dbxref Id (FK)Term Id (FK)Start PosEnd PosStrandRank
Location Qualifier ValueLocation Id (FK)Term Id (FK)
ValueInt Value
Loading & updating the NCBI Taxonomy
Language Bindings
Bioperl-db
• Step 1: connect, get adaptor factory
use Bio::DB::BioDB;# create the database-specific adaptor factory# (implements Bio::DB::DBAdaptorI)$db = Bio::DB::BioDB->new(-database =>”biosql”, # user, pwd, driver, host … -dbcontext => $dbc);
Bioperl-db• Step 2: e.g., retrieve sequence, add
annotation, update in the dbuse Bio::Seq; use Bio::SeqFeature::Generic;# retrieve the sequence object somehow …$adp = $db->get_object_adaptor(“Bio::SeqI”);$dbseq = $adp->find_by_unique_key( Bio::Seq->new(-accession_number => “NM_000149”, -namespace => “RefSeq”));# create a feature as new annotation$feat = Bio::SeqFeature::Generic->new( -primary_tag => “TFBS”, -source_tag => “My Lab”, -start=>23,-end=>27,-strand=>-1);# add new annotation to the sequence$dbseq->add_SeqFeature($feat);# update in the database$dbseq->store();
Tools for data loading:Sequences
• load_seqdatabase.pl (in Bioperl-db)
• All Bio::SeqIO and Bio::ClusterIO formats
• Flexible handling of updates
• --lookup, --noupdate, --remove, --mergeobjs
• Filtering and processing sequences
• --seqfilter, --pipeline
Bindings for most Bio* projects
• BioPerl (Bioperl-db)
• Biojava (BiojavaX)
• Biopython
• BioRuby (Active Objects-based)
• All updated at the 2008 BioHackathon
v1.0 Release (Tokyo)• Core BioSQL schema
(stable since Nov 2004)
• DDL for MySQL, PostgreSQL, Oracle, HSQLDB, Apache Derby
• Ancillary (but optional) files for PostgreSQL
• Documentation and ERD
• load_ncbi_taxonomy.pl
• Now LGPL v3.0 licensed
Download at http://biosql.org/DIST
v1.0 Release (Tokyo)• Core BioSQL schema
(stable since Nov 2004)
• DDL for MySQL, PostgreSQL, Oracle, HSQLDB, Apache Derby
• Ancillary (but optional) files for PostgreSQL
• Documentation and ERD
• load_ncbi_taxonomy.pl
• Now LGPL v3.0 licensed
Download at http://biosql.org/DIST
Tools for data loading:Ontologies
• load_ontology.pl (in Bioperl-db)
• All Bio::OntologyIO formats
• Additional options for term obsoletion
• --noobsolete, --updobsolete, --delobsolete, --mergeobjs
• (Re-)computing the transitive closure
• --computetc
•Set in motion at the 2008 BioHackathon in Tokyo
•Special thanks to Heikki Lehvähslaiho, Mark Schreiber, Richard Holland, and Raoul Bonnal
PhyloDB Module
• Phylogenetic trees (or networks)
• Metadata for trees, nodes, edges
• Attribute-value pairs
• Database cross-references
• Can attach taxa or genes to nodes
PhyloDB History
• Started at NESCent Phyloinformatics Hackathon 2006 with Bill Piel
• Expanded metadata capabilities at BioHackathon 2008 (Tokyo)
• Separate, optional module
• Not released yet, still in development
Tree-Name-Identifier-Is_Rooted
Node-Label-Left_Idx-Right_Idx
Edge
Node_Path- distance
Biodatabase
TermTaxon
Bioentry Ontology
-Value-Rank
Node_Qualifier_Value
Tree_Dbxref
-Value-Rank
Edge_Qualifier_Value
Node_Dbxref
-Value-Rank
Tree_Qualifier_Value
-Is_Alternate-Significance
Tree_Root
Dbxref
-Rank
Node_Taxon
-Rank
Node_Bioentry
Tools for loading data
• James Estill (U. Georgia):“A Perl-based Command Line Interface to a Topological Query Application for BioSQL in Support of High Throughput Classification and Analysis of LTR Retrotransposons in Plant Genomes”
• James Estill (U. Georgia):“A Perl-based Command Line Interface to a Topological Query Application for BioSQL in Support of High Throughput Classification and Analysis of LTR Retrotransposons in Plant Genomes”
What can you use BioSQL for?
Data Integration
SymGene(Oracle 9i)
ContentSynthesis-Genome mappings-Relationship harvest
J2EE API
Bioperl/Bioperl-DB
GenomeBrowser
SymAtlasWeb-Application (JSPs)
Ensembl CeleraLocusLink
RefSeq UniGene OMIM GNF1B U133A
SQL API (Views, PL/SQL)
PublishedWeb-Services
Rich ClientApp
6850(LocusLink)
GNF055813(GNF cDNA clones)
hCG29698 (Celera)
P43405 (UniProt)
NP_003168 (RefSeq)
NM_003177(RefSeq)
hCT1962558 (Celera)
hCT20865 (Celera)
207540_s_at (HG-U133A)
Hs.192182 (UniGene)
ENSG00000165025 (Ensembl)
ENST00000297685 (Ensembl)
36885_s_at (HG_U95Av2)
Platonic gene graphs
•Mostly used as a module with custom extensions
•Squares away sequence annotation and ontologies
Summary
• BioSQL has benefitted tremendously from hackathons.
• v1.0 Release allows to chart the way forward.
• PhyloDB module allows cross-project persistence of phylogenetic data.
• Use-cases range from simple to very complex.
Acknowledgments• Bio* contributors:
Aaron Mackey, Ewan Birney, Thomas Down, Matthew Pocock, Mark Schreiber, Richard Holland,Elia Stupka, Chris Mungall, Brad Chapman, Jeff Chang, Toshiaki Katayama
• Hackathons sponsors:
• DBCLS/CBRC (Tokyo 2008)
• NESCent (Durham 2006)
• Apple (Singapore 2003)
• O’Reilly (Tucson 2002)
• Electric Genetics (Cape Town 2002)