21

Click here to load reader

HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

Embed Size (px)

Citation preview

Page 1: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

Paper ID# SACBIO-129

HAVING A BLAST: ANALYZING GENE SEQUENCE DATA WITH BLASTQUEST –

WHERE DO WE GO FROM HERE?

AbstractIn this paper, we pursue two main goals. First, we describe a new tool called BlastQuest, for managing BLAST query results. BlastQuest provides interactive, Web-enabled query, analysis, and visualization facilities beyond what is possible by current BLAST interfaces. Specifically, the BLAST results, which are in XML format, are extracted, structured, and stored persistently in a relational database to support a series of built-in analysis operations that can be used to select, filter, and order data from multiple BLAST results efficiently and without referring to the original result files. In addition, users have the option to interact with the BLAST data through a mask-oriented, non-SQL query interface.

Despite BlastQuest’s recognized benefits for biologists, its functionality is limited in several important ways. The second goal of this paper is to analyze these shortcomings and describe a new concept based on two main pillars. (1) A Genomics Algebra, which provides an extensible set of high-level genomic data types (GDTs) together with a comprehensive collection of appropriate genomic functions, and (2) a Unifying Database, which allows us to integrate and manage the semi-structured contents of publicly available genomic repositories and to transfer these data into GDT values.

1. Introduction

Biologists are nowadays confronted with two main problems, namely the exponentially growing volume

of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing

complexity of biological applications and methods afflicted with an inherent lack of biological

knowledge. As a result, many and very important challenges in biology and genomics are now challenges

in computing and here especially in advanced information management and algorithmic design.

The currently most widely used and accepted tool for conducting similarity searches on gene

sequences is BLAST (Basic Local Alignment Search Tool) [1]. BLAST comprises a set of similarity

search programs that employ heuristic algorithms and techniques to detect relationships between gene

sequences and rank the computed ‘hits’ statistically. An essential problem for the biologist is currently the

processing and evaluation of BLAST query results, since a BLAST search yields its result exclusively in

a textual format (e.g., ASCII, HTML, XML). This format has the benefit of being application-neutral but

at the same time impedes its direct analysis. In this paper, we describe a new powerful tool, called

BlastQuest, for managing BLAST results stemming from multiple individual queries. This tool provides

the biologist with interactive and Web-enabled query, analysis, and visualization facilities beyond what is

possible by current BLAST interfaces. In particular, BLAST results from multiple queries are imported,

Page 2: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

structured, and stored in a relational database to support a series of built-in analysis operations that can be

used to select, filter, group, and order these data efficiently and without referring to the original BLAST

result files. In addition, users have the option to interact with the data through a user-tailored, screen-

mask oriented, non-SQL query interface based at a deeper, hidden level on a well-defined subset of SQL.

Section 2 elaborates on the current, main challenges in genomics and emphasizes the need for tools

capable of processing BLAST results. In Section 3, we describe our BlastQuest system from the system

architecture and user interface perspectives. Section 4 describes desired improvements to BlastQuest and

why new, sophisticated concepts, tools, and non-standard database technology, which altogether should

lead us far beyond BLAST technology, are indispensable in order to advance biological and genomic

research and progress. Finally, Section 5 draws some conclusions.

2. The Challenge of Genomics and Its Effect on Computer Science

Genomics is a biological discipline focused on understanding living organisms at the level of the whole

genome. It goes beyond a gene-by-gene approach and instead takes a global view of the complete genetic

system. Genomic scientists examine the full catalog of genes, the process that control them, gene inter-

relationships and inter-dependencies, and how the organism responds to changes in environment through

the expression of genetic information. In order to illustrate the challenges faced by scientists in this field,

we first review the most important concepts underlying gene sequencing.

2.1. Gene Sequencing

DNA is an information storage macromolecule to encode all of the heritable information passed from

generation to generation of living organisms. In biological systems, genetic information flows from DNA

(genes) to proteins, which are the molecules responsible for mediating or catalyzing biological processes.

In other words, inherited information is selectively converted into active biomolecules in response to

changing environmental conditions or demands. The molecular information pathway from gene to protein

goes through an intermediate class of molecules known as messenger RNA (mRNA). The synthesis of

mRNA is known as transcription, and the conversion of mRNA into protein is a process known as

translation. Both transcription and translation are important regulatory steps used to control which

genetic information is expressed, and when and where protein molecules will be made by the cell. The

2

Page 3: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

constellation of mRNA molecules in a cell at any moment represents the expressed genome. The

expressed genome is also referred to as the transcriptome. Identifying all the genes present in the

transcriptome effectively infers the proteins being utilized by the cell (also known as the proteome) and

essentially defines the current biochemical process of the cell. While characterizing the global cellular

proteome would be most direct and informative, this is not possible using currently available technology.

Instead genomics scientists use high throughput DNA sequencing to characterize the genome and the

transcriptome. Genome sequencing involves determining the nucleotide sequence of extensive

chromosomal regions or in some cases a complete nucleotide sequence of the whole genome.

Characterization of the transcriptome on the other hand involves full or partial sequence characterization

of mRNA molecules. Partial sequences of mRNA molecules are known as Expressed Sequence Tags

(EST) sequences. While the process of DNA sequencing is routine, nucleotide sequences do not directly

reveal their biological meaning or function. The possible biological function of a gene sequence must be

determined either through direct empirical experimentation, or more often through inferencing of gene

function using nucleotide sequence homology searches of gene databases such as GenBank [5].

2.2. Gene Homology Searches

Gene homology searches most often use the BLAST algorithm [1]. The BLAST search engine takes a

query nucleotide sequence and searches it against the database for entries matching the query. The

BLAST algorithm calculates statistical scores (bit scores and e-values) making real sequence homology

matches easier to distinguish from matches that might happen by chance. Other information included in

the BLAST result includes a short text string summarizing the biological properties of the database

match, and several unique identification numbers, the GI Number (unique ID for Genbank records) and

Accession Number, linking the matched sequence back to the GenBank database and to additional

information stored in the full database record. Each nucleotide query sequence submitted to the BLAST

search engine returns as few as zero (no matching homologous sequence) to hundreds of matching

database records. Results of BLAST searches are usually interpreted by reviewing the text output.

However, large-scale genomics projects often generate tens of thousands of nucleotide sequences

and the prospect of manually manipulating, summarizing, and interpreting the thousands of BLAST

output files is impractical at best. Scientists facing this informatics challenge may become discouraged or

3

Page 4: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

might overlook important information because they simply cannot find it. Clearly, methods or tools are

needed to help manage the process of identifying and evaluating unknown nucleotide sequences and the

sometimes-overwhelming information obtained in large-scale nucleotide sequence homology searches.

2.3. BlastQuest as an Answer to Tool Requirements from a Biologists Perspective

Genomics requires an information technology infrastructure on a scale previously unheard of and

specifically adapted to the unique data collection and analysis demands of biomedical science. The

BlastQuest system we describe demonstrates our current approach to management and visualization of

genomics information. It is by no means a complete biological data management solution, but our first

attempt to develop a prototype tool that can help us manage BLAST results through well-established

relational database principles. We are using BlastQuest to test new functionalities and evaluate the

strengths and limitations of relational databases as support tools for genomics research. Most important,

we believe BlastQuest will lead us to a new integrating data model, language, and tool for processing and

querying genomic information enabling scientists to synthesize biological insights through transparent

access to genomics information. We have more to say about these planned improvements in Section 4.

The BlastQuest Project began with several modest goals:

A BLAST results viewing tool accessible to research groups at remote locations . Users should

have access to their BLAST results from anywhere on the Web including the ability to share

results with colleagues in other locations.

Selective browsing of BLAST homology search results. As a first step, biologists want a broad

overview of the possible biological functions of the many genes sequences represented in their

DNA sequence data. The ability to reduce and summarize BLAST data to only the most

significant results is initially very informative.

Search capability on a variety of criteria, such as text terms on biological properties or gene

functions. As biological scientists identify their most interesting gene sequences they need a

way to focus and retrieve only those search results related to the precise topic of interest.

Selective data filtering on various BLAST statistical criteria such as e-value or bit score .

These statistical parameters help discriminate between real sequence homology matches and

matches that might happen by chance. There are no hard limits to the significance of these

4

Page 5: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

statistical parameters. The user will choose parameters giving either a more relaxed or

restricted view as needed.

Selective data grouping on criteria such as GI number, or a defined number of top-scoring

results. For example, viewing the three statistically best-scoring results for each query sequence

is a convenient way to summarize and browse BLAST results for many query sequences.

Grouping query sequences by GI number collects all of the query sequences having sequence

homology matches with the same sequences from the database. Two or more query sequences

sharing the same database homology match imply the query sequences are related to each other

and suggest additional analysis of the relationship is warranted.

Privacy constrained sharing of results among the scientists. DNA sequence data is often

proprietary and may constitute intellectual property. Such data should not be made public until

properly protected.

A convenient interface for getting queries into and BLAST results out of the system . The

interface must be attractive and logically implemented so users will be able to find and use the

tools the system provides.

We are unaware of an existing BLAST results management system incorporating all the goals stated

above. To the best of our knowledge, the functionalities of WebBLAST 2.0 [3] and the Ontario Center for

Genomic Computing OCGC BLAST [2] match many of our requirements but fall short in several

important aspects. For example, there is no provision in WebBLAST for applying global filtering and

grouping operations, or a mechanism for searching all BLAST results on user-supplied text terms. The

OCGC BLAST results manager appears closest to BlastQuest in functionality, allowing selected viewing

and data filtering on up to five criteria. However, OCGO BLAST is not available to genomics scientists

outside of the Province of Ontario, Canada. The BlastQuest Project is designed to meet our immediate

specific requirements, but most important, provide a platform we might freely modify to test our notions

of Genomics Algebra, an advanced query language for biological information.

3. The BlastQuest System

BlastQuest simplifies large-scale analysis in gene sequencing projects by providing scientists with a

means to filter, summarize, sort, group, and search BLAST data. BlastQuest extracts gene data from

5

Page 6: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

XML files, which are returned as the result of homology searches from BLAST engines, and stores them

in an underlying relational database. This allows the user to benefit from well-known relational concepts

like transactions, controlled sharing, and querying optimization.

The most frequently used user operations are hard-wired in the user interface and accessible via

command buttons. Their execution rests on SQL that is hidden from the user. To enable data analysis that

is not directly supported by the built-in user interface operations, BlastQuest offers a more flexible, mask-

oriented, and especially non-SQL query interface since biologists object to SQL due to its complexity and

low-level abstraction (see Section 4). This interface essentially allows the user to construct complex

boolean expressions as selection conditions which include logical operators and substring search

predicates. The underlying query execution is based on parameterized SQL queries, which are instantiated

and automatically translated into executable SQL code by the DBMS.

Another interesting feature of BlastQuest is that it can be linked to the so-called SMART (Simple

Modular Architecture Research Tool [6]) (see Section 3.1). The integration of BlastQuest output into

SMART for querying is in direct response to the desire by scientists for new tools and interfaces capable

of accessing and integrating external resources into one system. In Section 4, we describe our plans to

develop a Genomics Algebra query software that operates on a unifying database whose contents can

include data from existing genomics repositories. Finally, BlastQuest enables to manage BLAST data on

a per-project or per-user basis using the security features of the underlying database while at the same

time allow controlled sharing of this data in order to support collaboration.

3.1. Architectural Overview

Figure 1 depicts a conceptual overview of the 3-tiered BlastQuest system architecture. Tier 1

contains the database backend, which is implemented using an instance of the MySQL1 RDBMS. Since

BlastQuest is mainly a proof-of-concept prototype rather than a production-strength system, our choice

for a DBMS was governed by availability of source code and platform compatibility rather than

performance and richness in features. The database backend stores and manages BLAST and PHRAP

(Phragment Assembly Program) [4] results, which are represented as XML and ACE2 (ArChivE)

1 See http://www.mysql.com/.2 See http://bozeman.mbt.washington.edu/phrap.docs/phrap.html for an example and documentation on the format.

6

Page 7: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

documents and whose structure has been mapped into the relations Hit, NoHit, and Assemble shown

in Figure 2.

Figure 1: Conceptual overview of the BlastQuest system architecture.

For each gene sequence that produced a match during the BLAST search, the relation Hit stores

the XML file name where the original query sequence can be found as well as detailed hit information,

such as hit definition, expect value, bit score and so forth. The relation NoHit stores information about

those sequences, which have no database match by the homology search criteria. From a biological point

of view, sequences with no homologous sequence match often lead to new genes and are analyzed in a

different manner (outside of BlastQuest). In addition, the database also stores information about how

related gene segments are assembled into single consensus DNA sequences by PHRAP, which is external

to BlastQuest and invoked before the DNA sequence results are submitted to BLAST. PHRAP outputs its

results in an ACE file, which is mapped into the relation called Assemble. Querying the Assemble

relation with a specific consensus sequence name, one can retrieve all segments that are clustered into the

query consensus sequence.

7

Page 8: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

Figure 2: Relational Schema of the BlastQuest database.

The database also maintains information about users and their corresponding gene sequencing

projects, which are stored in the three remaining relations, User, Project, and UserProj. The

relation UserProj represents the many-many relationship between scientists and the projects to which

they belong. Since all sequence data is organized by project (using the PID foreign key in each of the

relations Hit, NoHit, and Assemble), BlastQuest provides control over who has access to which data.

Tier 2 contains the multi-threaded BlastQuest application program, which is divided into four

modules: The client interface module, which handles communication with the Web clients in tier 1, the

two loader modules for extracting and loading data from the XML and ACE input files into the database,

and the SQL constructor for assembling the queries and record sets to be sent to the database. The client

interface module is implemented as a series of Java Server pages (JSPs) that execute inside a Tomcat

server. The remaining three modules are implemented as Java classes.

The XML loader parses each BLAST result file into a Document Object Model (DOM)

representation using the Xerces Java Parser 1.4.4. The XML loader then extracts the relevant data items

needed to populate the Hit and NoHit tables. Specifically, the loader module contains two classes

whose structures correspond to the Hit and NoHit tables in the database schema. When the loader

collects data from an XML file, it populates the appropriate class objects with the extracted values. At the

end, the objects are passed to the SQL manager, which creates the SQL commands to insert the values

into the relational database. The ACE loader works in a similar fashion. However, since there was no

standard ACE parser available, we created our own. Our event-based parser detects the presence of

8

Page 9: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

certain keywords in the ACE input file and extracts the information associated with that keyword. It is

important to note that other, more efficient loading options are possible, for example by using the bulk

loading utilities of the DBMS. However, by making our loader modules part of the Web-based

middleware, users can load BLAST results into their BlastQuest accounts from anywhere on the Web as

long as they have access to a Web browser.

The SQL manager module is the gateway between the database (via the JDBC driver) and the

middleware. In addition to creating the SQL load commands, it translates commands from the user

interface into SQL queries, which can be executed by the DBMS. Analogously, it processes the resulting

record sets and creates the Java objects that are used by the client interface to generate the Web pages.

Tier-3 is a (thin) client interface, which is implemented as dynamic Web pages displayed inside a

Web browser. Client-side processing is limited to validation of user input, submitting requests to the

BlastQuest application and displaying HTML results.

3.2 Sample BlastQuest Session

A sample data analysis session shall illustrate some main features of BlastQuest. A page (not shown) of

the Web-browser component in BlastQuest facilitates the extraction of gene data from original, external

BLAST files into a MySQL database. Due to the large volume of data, a simple page-by-page viewing is

not helpful to the user but selection mechanisms are needed to find the data of interest. The overall user

interface strategy is to apply a sequence of consecutive operations on the data to approach gradually to the

data of interest. In the following we describe the main user interface features for doing this.

The first feature is to let BlastQuest create a summary page for selected sequence segments. For

each query DNA sequence, only the sequence database match with the best statistical score calculated by

BLAST is displayed with a summary of important biological information, usually text terms describing a

gene or protein name, and sometimes including possible biological functions. The summary page also

contains, for each matching sequence, the GenBank sequence ID, gene definition, and expect value.

The second feature is user-controlled selection. Unfortunately, the statistically calculated ranking

of matching sequences provided by BLAST does not necessarily correspond to the biological knowledge

and experience of the user. The user may apply their biological knowledge or insight to tag a different

result as better for expressing the possible function of the query sequence. By manually selecting a

9

Page 10: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

specific query result, the user can get additional information such as the percentage of identity, or

alignment of the query sequence and the matching sequence. Even a detailed display of sequence

alignments is available, which is identical to the free-text formatted BLAST result to which most BLAST

users are accustomed.

The third feature is related to built-in selection facilities, which can be activated by a mouse-click

and operate on every query sequences and their query results. Examples are the displays of hits with

expect values less than a particular threshold by selecting from a pull-down menu (Figure 3), or

restricting the display to the best n database matches for each query sequence. All filtering facilities

together give researchers the ability to adjust their analysis process to the particular research focus,

project status, and prior knowledge of query sequences, to reduce the original BLAST result to a

manageable size, and especially to remove results of low quality.

Figure 3: User-defined query construction tool.

The fourth feature comprises ordering and grouping functions. These help the user to discover

relationships among genes or expression patterns. For example, if the user asks for grouping on GI

number or query sequence, related sequences and their BLAST results are grouped together rather than

10

Page 11: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

appear randomly or out of context. This is also a proven method to identify EST sequences that come

from different regions of the same mRNA, gene orthologs, or gene paralogs3.

The fifth feature enables user-defined, mask-oriented, non-SQL queries. This feature refers to the

problem that the built-in functionality of BlastQuest is sometimes insufficient for specific analysis tasks.

BlastQuest provides a special Web page which allows the user to click on particular buttons, to manually

insert text, and in this way to interactively and textually construct complex boolean filter expressions

which may include logical operators like “AND” and “OR” as well as substring search predicates like

“Contains” or “Not Contains” (Figure 3). A search field (like “Hit Definition” in our example) to which

the Boolean expression is compared can be selected by a drop-down menu. Figure 3 shows two textual

representations of the same Boolean expression under construction. The second representation expresses

the condition in a way nearer to natural language. The first representation is a test mode translating the

‘natural language’ condition into SQL. In a later version the SQL test mode will disappear. The

construction of the Boolean expression and hence of the query is completed by clicking the “Commit”

button. BlastQuest assembles the SQL query, sends it to the MySQL driver, receives the results and

displays them. In the example in Figure 3 the user is just specifying a query which focuses on matches

that contain the word ‘reverse’, but not ‘hypothetical’.

The sixth and last main feature to be mentioned is interoperability between BlastQuest and other

biological information systems. Creating links to other systems in order to make use of their specific

functionality becomes more and more important for the biologist. In the context of BlastQuest, after

having examined the query sequences and their probable identities, we wish to derive the protein

sequences encoded by the nucleotide sequence. Rather than translate the nucleotide sequence directly,

BlastQuest takes the ‘best’ match, which represents a homologous gene closely related to the unknown

query sequence, and retrieves the corresponding protein sequence as translated by BLAST. After

grouping search results by query sequence (e.g., the best five statistical matches) the user is presented

with the screen shown in the top half of Figure 4. Next, the user checks the ‘amino conversion’ box at the

right top of the screen, and the check box adjacent to the query sequence they wish to translate into an

amino acid sequence. When the user clicks the ‘Details’ button, the ‘Sequence Analysis’ screen shown in

the bottom half of Figure 4 appears. The user may submit the derived protein sequence to the SMART

3 Gene orthologs are genes that are derived by divergent evolution, such as the -hemoglobin gene from human and from mouse. Gene paralogs are genes that are duplications, such as -hemoglobin and -hemoglobin.

11

Page 12: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

protein analysis Web site by simply clicking on the amino acid sequence. Results of the SMART analysis

will appear in the browser window.

Figure 4: Filtering and grouping BLAST results on a project basis.

All described operations can be combined to analyze data generated in a project. For example, the

user may ask BlastQuest to retrieve hits with expect value lower than 0.05, followed by grouping on gene

ID, and only display the top five matching hits per GI number. The screen snapshot in Figure 4 shows this

result.

4 Evaluation and Planned Improvements

The BlastQuest system described above has been used successfully by scientists in a gene-sequencing lab

at a University for over six months and the feedback from users has been positive. However, we also

received important feedback regarding the limitations of the current system. For example, there is a desire

for additional, more sophisticated analysis functionality, the ability to integrate data from external

repositories, etc. As a starting point for the development of a more sophisticated management system for

12

Page 13: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

genomics data, we have identified all of the biological needs that are currently not supported in

BlastQuest4. In the interest of space, we provide the readers with an overview of the most important ones:

1. The ability to query, search and analyze data from external genomics repositories (in addition

to those accessible through BLAST). An extension of this is the ability to integrate related

results from multiple repositories in a meaningful manner, for example, to fill in missing values

or correct inconsistencies that exist across different repositories.

2. A representation of the genomics data that is semantically richer than the current textual

representation provided by BLAST and most other repositories. For example, BLAST query

results are more or less collections of textual strings and numerical values and are not expressed

in biological terms such as genes, proteins, and nucleotide sequences. As a result, BLAST and

BlastQuest operations are limited to basic string manipulation (e.g., shortest common substring)

rather than high-level, gene-specific operations such as transcribe, translate, etc.

3. Integration of new specialty evaluation functions. The possibility to evaluate data from BLAST

results as well as self-generated data with publicly available methods is insufficient. Thus, it

must be possible to create, use, and integrate user-defined functions that are capable of

operating on both kinds of data into the analysis interface of the tool. However, this requires an

extensible database management system, query language, and user interface, which is currently

not part of BlastQuest.

4. The ability to create and store new knowledge. A biologist generates new biological data from

their own research or experimental work, for example, by analyzing BLAST results. Hence,

scientists have expressed a strong desire to store and manage this newly created knowledge

together with the source data. For example, there is a need to annotate data in BLAST results

and to store the annotations persistently so that they can be re-used (e.g., by linking a record in

a new BLAST result to an existing annotation in the repository).

5. Support for controlled collaboration among multiple scientists. It is of great value for scientists

to share some of the their findings in a controlled manner with colleagues. For example, it

4 In fact, based on a survey of the related literature, we have found that most of the existing integration and management systems for genomics data such as K2/KLEISLI (http://db.cis.upenn.edu/K2/), Tambis (http://imgproj.cs.man.ac.uk/tambis/index.html), SRS (http://srs.ebi.ac.uk), etc. only support some of the functionality described in this list.

13

Page 14: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

should be possible among the users of a genomics repository system to grant write access to

some of the annotations but read-only or no access to others.

6. The ability to connect DNA sequence identities inferred from BLAST results with gene-

associated biological functions described through the efforts of the Gene Ontology (GO)

Consortium [7]. This type of cross-referencing is the best way to describe the functionality of a

newly, discovered gene. This functionality will help biologists to annotate and catalog the

genes by universally accepted GO IDs and hence help them to discover new genes.

Based on this list, which illustrates the complexity of the information-related challenges that confront

biologists and computer scientists, we decided to redesign our current system from the ground up. For

example, to provide users with a semantically rich representation of the genomics data as well as support

for specialty functions (needs 2 and 3 above), requires the design of a new data type system and

operations, which must be integrated with the underlying database management system for efficient query

processing and persistence. Another example, access to multiple genomics repositories (need 1) requires

the ability to extract, translate, and reconcile heterogeneous data from multiple sources and store the

integrated result using a global schema, which has been constructed either from the local schemas of the

sources or based on general knowledge of the domain.

In response to our requirements analysis, we are developing a new genomics integration and

management system that is based on two fundamental pillars: (1) A Genomics Algebra software system to

provide an extensible set of high-level genomic data types (GDTs) (e.g., genome, gene, chromosome,

protein, nucleotide) together with a comprehensive collection of appropriate genomic functions (e.g.,

translate, transcribe, decode). (2) A Unifying Database, which allows us to manage the semi-structured

or, ideally, structured contents of publicly available genomic repositories and to transfer these data into

GDT values. These values then serve as arguments of Genomics Algebra operations, which can be

embedded into a DBMS query language.

We believe our new approach will cause a fundamental change in the way biologists analyze

genomic data. No longer will biologists be forced to interact with hundreds of independent data

repositories each with their own interface. Instead, biologists will work with a unified database through a

single user interface specifically designed for biologists. Our high-level Genomics Algebra will allow

biologists to pose questions using biological terms, not SQL statements. Managing user data will also

14

Page 15: HAVING A BLAST: ANALYZING GENE SEQUENCE …jhammer/publications/SAC2003/... · Web viewHaving a Blast: Analyzing Gene Sequence Data with BlastQuest – Where Do We Go from here? Abstract

become much simpler for biologists, since his/her data can also be stored in the Unifying Database and no

longer will s/he have to prepare a custom database for each data collection. Biologists should, and indeed

want to invest their time being biologists, not computer scientists.

From a computer science perspective, our project leverages and extends the benefits and

possibilities of current database technology. In particular, we demonstrate the elegance and expressive

power of modeling and integrating non-standard and extremely complex data by the concept of abstract

data types into databases and query languages. In addition, our approach is independent of a specific

underlying DBMS data model. That is, the Genomics Algebra can be embedded in a relational, object-

relational, or object-oriented DBMS as long as it is equipped with the appropriate extensibility

mechanisms. In addition, we believe we will gain valuable knowledge about the design and

implementation of new, sophisticated data structures and efficient algorithms in the non-standard

application field of biology and bioinformatics.

5 Conclusion

In this paper we have described BlastQuest, a Web-based and interactive tool for importing and

persistently storing genomic data from multiple BLAST queries in a relational database, applying DBMS

functionality for processing and querying these data, and visualizing them appropriately. Limitations of

the underlying concept, which will inevitably be reached even through some meaningful improvements,

require new concepts and advanced tools. The Genomic Algebra briefly sketched at the end is a

promising approach in this direction.

References

15