CIPRES Database Focus Group

CIPRESDatabase Focus Group

NSF Site Visit

June 28, 2006

San Diego

Senior Personnel

• Susan Davidson, University of Pennsylvania

• Michael Donoghue, Yale University

• Mark Miller, San Diego Supercomputer Center

• Dan Miranker, UT Austin

• Brent Mishler, UC Berkeley

• William H. Piel, Yale University (TreeBASE II lead)

• Val Tannen, University of Pennsylvania (database focus lead)

Other (Partially) Funded Personnel

• Lucie Chan, Senior Software Developer, San Diego Supercomputer Center

• Shirley Cohen, Database Developer, then PhD Student, UT Austin, then University of Pennsylvania

• Sarah Cohen-Boulakia, Post-Doc, University of Pennsylvania (not funded by CIPRES)

• Jin Ruan, Senior Software Developer, San Diego Supercomputer Center (TreeBASE II Software Lead)

• Yifeng Zheng, PhD student, University of Pennsylvania.

Goals of the Database Focus

• The major objective is the development of TreeBASE II

• In addition, this focus has supported related research on– storage/querying of the large phylogenetic trees constructed in

• the Simulation Focus (Davidson, Kim, Zheng)

• the Algorithms Focus of the project (Moret, Hunt, Warnow)

– data provenance in phyloinformatics workflows

(Davidson, Cohen, Cohen-Boulakia)– phylogenetic database extensions using a metric ordering to

support molecular data (Miranker)– genome-scale phylogenetics (Piel)– searching large collections of trees for topological patterns (Piel)

The current TreeBASE (I)• A 10+ years-old major data resource for biological and

biomedical research– submissions needed to be published in a peer-reviewed scientific

journal before being published in TreeBASE.

• Has been searched from over 60,000 distinct IP addresses• Has accepted over 1,300 submissions that map to over

– 3,700 trees and – 60,000 distinct taxons.

• But the capabilities of the current database are being overtaken by demands.

• CIPRES is developing TreeBASE II as a robust, scalable, and versatile re-design and re-engineering of TreeBASE I.

TreeBASE I Audience

Researchers from – traditional systematics backgrounds and – molecular biology backgrounds

who are concentrating on a series of focused experiments in the lab.

These users include those who periodically seek online representations of individual phylogenies for research and educational purposes.

Additional TreeBASE II Audiences (1)

Researchers that want to run meta-analyses on large collections of trees. Examples:

• identifying patterns in trees that result from one type of analysis over another

• visualizing large collections of trees

• studying collaborative networks among phylogeneticists

Additional TreeBASE II Audiences (2)

Phyloinformaticians who seek to make large-scale inference using synthetic methods applied to large collections of trees. Examples:

• assemble a supertree for a large branch of the Tree of Life

• mine data in search of conflicting phylogenetic signals

• examine the evolution of genes and genomes in a comparative context

Additional TreeBASE II : Audiences(3)

Bioinformaticians who conduct simulation studies.

Frequently, simulation studies use simple models, such as the Kimura 2-Parameter and Jukes-Cantor that are not believed to be biologically realistic.

Finding realistic evolutionary models, using real data, and carrying out simulation studies are some of the main goals of this group.

Value Added by TreeBASE II

• A phylogenetic query language to allow ``power-users'' to

run complex phyloinformatic queries, including on tree topology.

• A robust service layer and LSIDs to allow external tools and services to interface with the database.

• Storage of LSIDs and foreign handles to better integrate with external data services (morphological characters, gene names, taxon names, and museum specimen IDs).

• Taxonomic intelligence for leaf and node labels.

• Ability to store geographic coordinates to support phylogeographic data visualization and analysis.

Collected Use Cases: Query Examples

• Given a set of taxa and a character matrix, find the

characters for which the taxa have the same state.

• Given a set of taxa and a set of trees, find all trees for which the subtree determined by the taxa (as leaves) is the same.

TreeBASE II Capabilities: Submission

• Friendlier interface, more features semi-automated

• Support for entering additional (currently non-NEXUS) data such as specimen IDs

• Automated annotations (eg., communication with other sources to retrieve GenBank accession number sequence)

• Better error checking (eg., matching taxon labels between trees and character matrices)

• Assistance features will be opt-in and can be turned off by the user

TreeBASE II Capabilities: Curation

• Support for interaction with the publication process:– In conjunction with journal submission, study data is submitted to TreeBASE – It is not made visible to search/query users but reviewers or journal editors

can examine it (anonymous access)– If and when the journal submission is accepted, the study data is made

visible to search/query users• Support for TreeBASE II editors, examples:

– to correct author, citation, or other metadata– to correct the taxon names (alignment between trees and character matrices

or with taxonomic services)– to remove orphan data

An interface with access to taxonomic services such as uBio (www.ubio.org) or the Glasgow Name Server (taxonomy.zoology.gla.ac.uk/rod/rod.html) will be provided to facilitate both submission support and curation capability.

TreeBASE II Capabilities: Search (1)

2-step configurable GUI retrieving sets of studies, matrices, or trees.

– Step 1: choose search criteria– Step 2: choose search

• Study Search By:– Disjunction of conjunctions of author last names– Citation title matches given keyword(s)– Name matches keyword– Contains analysis/analysis step such that:

• Name matches given keyword(s)• Uses given algorithm• Uses given software package• Input and/or output data contains given set of taxa• Input and/or output data contains tree that matches given tree pattern • Input and/or output data contains matrices satisfying given search criteria (same

as below)

TreeBASE II Capabilities: Search (2)

• Tree Search By:– Tree id number

– Appears in a study satisfying given search criteria (same as above)

– Appears in an analysis/analysis step satisfying given search criteria (same as above)

– Contains given set of taxa

– Matches given tree pattern

• Matrix Search By:– Uses given set of taxa

– Uses given set of character names

– Is a sequence matrix that uses a certain kind of biomolecular information

– Contains given specimen(s)

TreeBASE II Capabilities: Bulk Queries

XML-based query interface for tools that interoperate with TreeBASE II

• Input: domain-specific query language – based on theTreeBASE Domain Model

– related semantically to a simple subset of SQL or ODMG/OQL

– XML-based syntax

• TreeBASE XML format for query output– Nexus data

– additional data in TreeBASE II

• For the CIPRES tool which is CORBA-based we will use an IDL-to-XML bridge

• Interactive (sophisticated) user can also submit prepared query

TreeBASE II Domain Model

A detailed object-oriented Domain Model was designed for TreeBASE II

(EER diagrams were manually derived from the Domain Model)

A very partial and simplified view:

Study Data

Matrix Tree

Taxon

1

1

1

1

MatrixRow RowSegment Specimen

1

1 1 1

Technologies used in TreeBASE II development

• Open source• Proven technologies and best practices• Hibernate to generate the SQL schema from the Domain

Model• Hibernate, based on the Domain Model, to program any

database access• Tomcat Web container and one of SDSC's Web farms• Spring framework as an application container to manage

transactions

Status and Future Plans

• Requirements and use case collection is complete

• The architectural design is complete

• Currently working on detailed design and coding, including GUI work and loading data from TreeBASE I (some is ready)

• A demo will be performed during the site visit

• TreeBASE I data will be loaded by August 2006

• Elements of the interactive user interface will be beta released and end-user tested throughout Fall 2006

• New submissions accepted starting February 2007

• Links to taxonomic services developed in Spring 2007

• Bulk query API, including CIPRES tool interface, developed in 2007

• Available as Web service at end of 2007

Documents

CIPRES Database Focus Group