Databases for Bioinformatics and Genomics Jonathan Crabtree [email protected] Center for Bioinformatics University of Pennsylvania

Databases for Bioinformatics and Genomics

Jonathan [email protected]

Center for BioinformaticsUniversity of Pennsylvania

OUTLINE

• Introduction to relational databases

• Some problems specific to bioinformatics

• Survey of bioinformatics databases and resources on the web

Part I: An Introduction to Relational Databases

OR:

"Just what is the difference between Larry Ellison and God?"

What is a Database?

• Collection of data, oft. organized for fast retrieval

• Example: BLAST database (FASTA file)>gi|14728383|ref|XM_003323.4| Homo sapiens clock (mouse) homolog (CLOCK), mRNAGAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTAAGATGCCTTTGGATAATTCTACAGTCCTCTTAAATGAATCTTTAGAACTTGGCAAGTCTCACTAGATACCTTCAATCATCATTTTGAGCTCAAAGAATTCTGAGACTTATGGTTGGTCATATAGAAGAGTACCTTGAACCTATAGTTTCC..>gi|5453027|gb|AF071947.1|MMTECH13 Mus musculus protein tyrosine kinase Txk (Txk)

gene, exon 1 and partial cdsTTCTCGTCCCGTCTTCGTCTCTCGTCTCGTCGTCGTCGTCGTCTCTCTGCCTCTCCGTCTCGTCGGATCTCGTCGTCTCTCGTCTCGTCTTTCTCTCGTCTGTCAAGGTTTCTCTATGTAACCCTGGCTGTCCTGGAACTCTATGTAGACCAGGCTGGCCTTGAACTCACTCGCTTCTGCCTCCCAGAGTGCTGGGATTAAAGACTTTTGCCACCACACTTAGCTTCAGCCAGTTTTTAAAACAGATCTTTAGATTCTCCCCTAATCCCAAACCAAGTCT..>gi|1161348|gb|U34367.1|HSTECTXT01 Human protein tyrosine kinase TEC (tec) gene,

partial cds, and tyrosine kinase TXK (txk) gene, exon 1TCTAGATGCTTTATACCACTTCCTCTGGAGCAGCTCTGCTTTAGATTTTTTACATATNGGGCCTTTGGGTAAAATTTTATTTAAAGAACAAAATTGCTTACAAAGAAAAGCTTGCAAACTATTTTATGATACCTTGTCTCTACTGCCTTATAAAAAGAAAGAAAAACAGAAAAATGAAAGTGCAGATGGTAACGTGGTGATGGATCCATCTCATGCACTTAAATTAAGCTGAGAGGAAGTTGGAAATCACCAGGTGTGGCCACTGAGGTGAACATTTAGC..

BLAST Database Limitations

• Scarce on structured data (except the sequence)

• Not guaranteed to have any unique identifier

• Defline is often structured by convention only (as in our example)

• Designed to answer only one class of queries: "What do you have that's similar to this sequence?"

ID HSECTXT01 standard; DNA; HUM; 5579 BP. XX AC U34367; XX SV U34367.1 XX DT 24-JAN-1996 (Rel. 46, Created) DT 02-JUL-1999 (Rel. 60, Last updated, Version 7) XX DE Human protein tyrosine kinase TEC (tec) gene, partial cds, and tyrosine DE kinase TXK (txk) gene, exon 1. XX KW . XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RX MEDLINE; 96197775. RA Ohta Y., Haire R.N., Amemiya C.T., Litman R.T., Trager T., Riess O., RA Litman G.W.; RT "Human Txk: genomic organization, structure and contiguous physical linkage RT with the Tec gene"; RL Oncogene 12(4):937-942(1996). XX RN [2] RP 1-5579 RA Litman G.W.; RT ; RL Submitted (18-AUG-1995) to the EMBL/GenBank/DDBJ databases. RL Gary W. Litman, All Childrens Hospital, 801 Sixth Street South, St RL Petersburg, FL 33701, USA

Compare with FASTA/BLAST format:>gi|1161348|gb|U34367.1|HSTECTXT01 Human protein tyrosine kinase TEC (tec) gene, partial cds, and tyrosine kinase TXK (txk) gene, exon 1

EMBL format

XX DR SPTREMBL; Q14219; Q14219. DR SWISS-PROT; P42681; TXK_HUMAN. XX FH Key Location/Qualifiers FH FT source 1..5579 FT /db_xref="taxon:9606" FT /organism="Homo sapiens" FT /cell_type="leukocyte" FT /tissue_type="blood" FT /dev_stage="adult" FT intron <1..407 FT /gene="tec" FT mRNA join(<408..526,608..765,1931..3624) FT /gene="tec" FT /product="protein tyrosine kinase" FT CDS join(<408..526,608..765,1931..2014) FT /codon_start=2 FT /db_xref="SPTREMBL:Q14219" FT /gene="tec" FT /product="protein tyrosine kinase" FT /protein_id="AAB60411.1" FT /translation="YVLDDQYTSSSGAKFPVKWCPPEVFNYSRFSSKSDVWSFGVLMWE FT VFTEGRMPFEKYTNYEVVTMVTRGHRLYQPKLASNYVYEVMLRCWQEKPEGRPSFEDLL FT RTIDELVECEETFGR" FT exon 5166..5267 FT /number=1 FT /gene="txk" FT /product="tyrosine kinase" XX SQ Sequence 5579 BP; 1512 A; 1112 C; 1275 G; 1679 T; 1 other; tctagatgct ttataccact tcctctggag cagctctgct ttagattttt tacatatngg 60 gcctttgggt aaaattttat ttaaagaaca aaattgctta caaagaaaag cttgcaaact 120 attttatgat accttgtctc tactgcctta taaaaagaaa gaaaaacaga aaaatgaaag 180 tgcagatggt aacgtggtga tggatccatc tcatgcactt aaattaagct gagaggaagt 240 tggaaatcac caggtgtggc cactgaggtg aacatttagc ttcaaccgga tgcttgttgg 300 . . .

EMBL Format

• Considered a "flat-file" format (i.e. plain text)

• But it allows for additional explicit structure Each entry has multiple named "fields" The meanings of these fields are predefined

• Entries have a unique identifer or "accession" "Primary key" in relational database lingo And its presence leads to an obvious query: "Tell me

everything you know about the entry with ID X" Can be answered quickly by indexing the file

Indexed Flat File

1 ID HSECTXT01 standard; DNA; HUM; 5579 BP. 2 XX 3 AC U34367; 4 XX 5 SV U34367.1 6 XX 7 DT 24-JAN-1996 (Rel. 46, Created) 8 DT 02-JUL-1999 (Rel. 60, Last updated, Version 7) 9 XX 10 DE Human protein tyrosine kinase TEC (tec) gene, partial cds, and 11 DE tyrosine kinase TXK (txk) gene, exon 1. 12 XX 13 KW . 14 XX . .4073 ID AF071947 standard; DNA; ROD; 1381 BP.4074 XX4075 AC AF071947;4076 XX4077 SV AF071947.14078 XX4079 DT 13-JUL-1999 (Rel. 60, Created)4080 DT 13-JUL-1999 (Rel. 60, Last updated, Version 1)4081 XX4082 DE Mus musculus protein tyrosine kinase Txk (Txk) gene, exon 1 and 4083 DE partial cds...N lines

INDEX FILE EMBL-FORMAT FLAT FILE

U34367.1 1AF071947.1 4073...M entries

Find entry with a particular ID:

Naïve approach: O(N)

Using an index: O(M)

Can be improved: O(log M) or O(C)

Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Human protein tyrosine kinase TEC (tec) gene, partial cds, and tyrosine kinase TXK (txk) gene, and translated products" , source { org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo" , gcode 1 , mgcode 2 , div "PRI" } } ,.. from journal { title { iso-jta "Oncogene" } , imp { date std { year 1996 , month 2 , day 15 } , volume "12" , issue "4" , pages "937-942" } } } } ,..

Compare with EMBL format: RL Oncogene 12(4):937-942(1996).

ASN.1 (GenBank)

ASN.1

• Unlike EMBL format, it allows one to give a formal description of what data is valid So its validity can be checked automatically Same idea as DTDs for XML, which is quite similar

• But, as with everything we've seen so far, ASN.1 data is typically stored in plain text files

• Used extensively for data management at NCBI (i.e. GenBank, PubMed, Entrez)

Data Model vs. Storage Format

Increasing sophistication of data model

Increasing sophistication of physical storage mechanism

flat file indexed flat file RDBMS OODBMS

FASTA format EMBL format XML/ASN.1/object oriented(free text) (tag,value pairs) (arbitrary nesting of data)

• Can store FASTA files in a relational database

• Or highly-structured datasets in a flat file

What's Wrong with Flat Files?

• Only one person can edit them at any given time

• Edits must be checked for validity

• Can become corrupted if the computer crashes

• Don't support queries without additional data structures and/or search tools (e.g., BLAST)

• Implementing access controls will likely require splitting the file or writing additional software

• Many systems have fairly low file size limits

• Database guys will laugh at you behind your back

Relational (and other) databases:

• Solve many of these problems in a generic way

• Provide a reasonable sophisticated data model

• Enable fast queries over data stored in that model

• Ensure the integrity & consistency of the data

Components of a Database

• Data model - what the data looks like How is the database structured? Note that a database can have structure that is not

explicitly identified by the data model.

• Access method(s) - how it can be accessed Usually based on the data model Could be an external tool like BLAST Or a general-purpose "query language"

The Relational Data ModelSCHEMA/DATABASE

Sequencesgi accession locus description14728383 XM_003323.4 Homo sapiens clock(mouse)...5453027 AF071947.1 MMTECH13 Mus musculus protein tyrosine...1161348 U34367.1 HSTECTXT01 Human protein tyrosine kinase...

RELATION/TABLE

ROW

COLUMN/FIELD

Sequencesgi accession locus descriptionint(10) varchar(15) varchar(20) varchar(200)DATATYPES

Relational Query Languages

• The Structured Query Language (SQL) First invented in the early 70's

• SQL incorporates: DDL - Data Definition Language

• Create/modify/delete schema objects

• Grant/revoke access to them

DML - Data Manipulation Language

• Used for entering and retrieving data (the "query" part)

What men or gods are these? What maidens loth? What mad pursuit? What struggle to escape? What pipes and timbrels? What wild ecstasy?

-John Keats, Ode on a Grecian Urn

What is the average salary in the Toy department?

-Anonymous SQL user

(From "Database Management Systems" by Raghu Ramakrishnan, p. 181)

DDL: Data Definition Language

CREATE TABLE Sequences ( gi int(10), accession varchar(15), locus varchar(20), description varchar(200));

GRANT SELECT ON Sequences TO fred;

GRANT SELECT ON Sequences TO fred WITH GRANT OPTION;

DROP TABLE Sequences;

DML: Data Manipulation Language

INSERT INTO Sequences VALUES(14728383, 'XM_003323.4', NULL, 'Homo sapiens clock(mouse) homolog (CLOCK), mRNA');

INSERT INTO Sequences VALUES(5453027, 'AF071947.1', 'MMTECH13','Mus musculus protein tyrosine kinase Txk (Txk) gene, exon 1 and partial cds');

UPDATE Sequences SET locus = 'CLOCK' WHERE gi = 14728383;

DELETE FROM Sequences WHERE gi = 14728383;

DML Queries

A basic SQL query:

SELECT gi, description FROM Sequences WHERE locus = 'MMTECH13';

The "WHERE" clause is optional:

SELECT gi, description FROM Sequences; SELECT * FROM Sequences;

"NULL" values get special treatment:

SELECT * FROM Sequences WHERE locus IS NULL; (not always the same as "WHERE locus = NULL")

Joins of Two Or More TablesSequencesgi accession locus description14728383 XM_003323.4 Homo sapiens clock(mouse)...5453027 AF071947.1 MMTECH13 Mus musculus protein tyrosine...1161348 U34367.1 HSTECTXT01 Human protein tyrosine kinase...Genesname species sequence_gi descriptionClock human 14728383 Circadian Locomotor Output Cycles KaputTxk mouse 5453027 Protein tyrosine kinase TXKTec human 1161348 Tec protein tyrosine kinaseTxk human 1161348 Protein tyrosine kinase TXK

SELECT g.name, s.accession, s.locus FROM Sequences s, Genes g WHERE g.gene_name = 'Txk' AND g.species = 'human' AND g.sequence_gi = s.gi;

Joins, Cont.Query resultg.name s.accession s.locusTxk U34367.1 HSTECTXT01

• Note that: The result of the query is itself a relational table The columns used to perform the join (gi and sequence_gi) need not appear in the result table

Our table aliases appear in the column names

• This type of join is call an "equijoin"

• "Select", "Project", and "Join" basic operators

How are Joins Performed?"NESTED LOOP" ITERATION FOR OUR EXAMPLE JOIN(IGNORING THE RESTRICTIONS ON NAME and SPECIES):

for each row G in table Genes { for each row S in table Sequences { if (G.sequence_gi = S.gi) then add G+S to the result }}

MORE GENERALLY, G(row1,row2) IS ANY BOOLEAN FUNCTION:

for each row R in table T1 { for each row S in table T2 { if G(R,S) then add R+S to the result }}

Another Join Strategy for EquijoinsASSUME WE HAVE AN INDEX ON Sequences.gi:

for each row G in table Genes { seqRows = gi_index_lookup(g.sequence_gi) for each row S in set seqRows { add G+S to the result }}

OR IF WE HAD AN INDEX ON Genes.sequence_gi:

for each row s in table Sequences { geneRows = sequence_gi_index_lookup(s.gi); for each row G in geneRows { add G+S to the result }}

Query Optimization

• Choosing a specific query plan for a query Query plans are assembled by choosing from among

predefined strategies like those we just saw

• Much database research is devoted to this topic Query plans are usually not optimal Can be a highly computationally intensive process,

particularly when many tables are involved

What Must the Optimizer Decide?

• Order in which to perform operations Particularly the order in which to perform joins Accounts for much of the complexity

• Join strategy and/or index selection It is not always faster to use an index

• Other considerations Optimize for response time or overall time? Use additional resources to speed up queries?

Cost-based Optimization versus Rule-based

• Cost-based Enumerates a set of plans and then chooses the "best" Best = lowest predicted "cost" Cost usually defined in terms of disk accesses Relies on table statistics / selectivity estimates

• Rule-based Applies a set of predefined rules to transform a single

initial plan until no more rules apply e.g., always use the most specific index possible

RDBMS Advantages Revisited

• Permit arbitrary queries on structured data

• Enforce constraints on the data

• Maintaining data integrity in the face of concurrent updates - typically with transactions

• Support different logical views on the data

An Introduction to Constraints

• The following is also a valid FASTA file:>yadaGAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTAAGATGCCTTTGGATAATTCTACAGTCCTCTTAAATGAATCTTTAGAACTTGGCAAGTCTCACTAGATACCTTCAATCATCATTTTGAGCTCAAAGAATTCTGAGACTTATGGTTGGTCATATAGAAGAGTACCTTGAACCTATAGTTTCC>yadaTTCTCGTCCCGTCTTCGTCTCTCGTCTCGTCGTCGTCGTCGTCTCTCTGCCTCTCCGTCTCGTCGGATCTCGTCGTCTCTCGTCTCGTCTTTCTCTCGTCTGTCAAGGTTTCTCTATGTAACCCTGGCTGTCCTGGAACTCTATGTAGACCAGGCTGGCCTTGAACTCACTCGCTTCTGCCTCCCAGAGTGCTGGGATTAAAGACTTTTGCCACCACACTTAGCTTCAGCCAGTTTTTAAAACAGATCTTTAGATTCTCCCCTAATCCCAAACCAAGTCT>TCTAGATGCTTTATACCACTTCCTCTGGAGCAGCTCTGCTTTAGATTTTTTACATATNGGGCCTTTGGGTAAAATTTTATTTAAAGAACAAAATTGCTTACAAAGAAAAGCTTGCAAACTATTTTATGATACCTTGTCTCTACTGCCTTATAAAAAGAAAGAAAAACAGAAAAATGAAAGTGCAGATGGTAACGTGGTGATGGATCCATCTCATGCACTTAAATTAAGCTGAGAGGAAGTTGGAAATCACCAGGTGTGGCCACTGAGGTGAACATTTAGC

Duplicate and missing sequence identifiers No restrictions on duplicated sequences Some descriptions do require a "sequence name"

BLASTN 2.0MP-WashU [16-Dec-1999] [linux-x86 23:30:47 16-Dec-1999]

Copyright (C) 1996-1999 Washington University, Saint Louis, Missouri USA.All Rights Reserved.

Reference: Gish, W. (1996-1999) http://blast.wustl.edu

Notice: this program and its default parameter settings are optimized to findnearly identical sequences rapidly. To identify weak similarities encoded innucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query= Query sequence (210 letters)

Database: test.fsa 3 sequences; 770 total letters.Searching....10....20....30....40....50....60....70....80....90....100% done

Smallest Sum High ProbabilitySequences producing High-scoring Segment Pairs: Score P(N) N

yada 1050 5.5e-47 1

>yada Length = 210

Plus Strand HSPs:

Score = 1050 (163.6 bits), Expect = 5.5e-47, P = 5.5e-47 Identities = 210/210 (100%), Positives = 210/210 (100%), Strand = Plus / Plus

Query: 1 GAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTA 60 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1 GAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTA 60

.

.

.

Constraints, Cont.

• BLAST is perfectly happy with this database

• But ideally we would like to ensure: Every sequence has some kind of identifier

• so something shows up in our output

Each identifier is unique, at least within a database

• so we know what we've hit

• In a relational database we would use: "NOT NULL" constraint "UNIQUE" constraint

Constraints in SQL

CREATE TABLE Sequences ( gi int(10) NOT NULL, accession varchar(15) NOT NULL, locus varchar(20), description varchar(200), PRIMARY KEY(gi), UNIQUE(accession), INDEX(accession));

Transactions: "ACID" Properties

• Atomicity All operations in a transaction fail or all succeed

(even in the event of a system crash) There are no "partially processed" transactions

• Consistency If the database is in a consistent state before the

transaction starts, then it must be in a consistent state when it ends (e.g., with respect to its constraints.)

Transactions, Cont.

• Isolation Actions carried out in one transaction should not be

affected by those carried out in another (unless the other completes before it starts.)

• Durability Once the system reports a transaction has succeeded

its effects must persist in the database Even in the event of one or more system crashes

Implementing Transaction Isolation

• Typically entails the use of "locks"

• Locks come in different flavors, for example: read-only (or "shared") exclusive

• 2-phase locking is one relatively simple scheme to guarantee serializability

• Locking can unfortunately result in "deadlock"

Ensuring Atomicity and Durability

• Write-Ahead Logging Two distinct storage areas in most databases Write to the log before updating the actual tables And before saying that a commit has succeeded Then transactions can be rolled back or forward

• "Well, I hope you've got the WAL on a RAID-1 volume. Keep up the good work."

A Word On Views

• A view is a table constructed by a query

• May or may not be "materialized"

• May or may not be updateable

• Can be used to implement access controlCREATE VIEW HumanGenesAS SELECT * FROM Genes WHERE g.species = 'human';

GRANT SELECT ON HumanGenes TO bob;

Commercial Database Systems

• Sybase and Oracle Arch-enemies, with Oracle the current leader Both now include object-relational features and are

marketed as so-called "universal" databases

• IBM's DB2 IBM has a long history of RDBMS development,

beginning with "System R" in the 1970s

• O2 ("O deux"), Poet, Versant, etc.

Open Source Database Systems

• MySQL http://www.mysql.com Very popular; reported to be fast for queries Lacks a few common SQL features, some by design

• Subqueries

• Transactions (until recently)

• Foreign key constraints

• Stored procedures & triggers

• Views

http://www.mysql.com/

Open Source Database Systems

• PostgreSQL (post-gres-Q-L) http://www.us.postgresql.org/index.html Object-relational DBMS Based on the POSTGRES research prototype from

Stonebraker's lab. in UCal. Berkeley Currently enjoying a resurgence in popularity Similar ideas were commercialized by Stonebraker in

Illustra, which was acquired by Informix circa 1995

References

• FASTA format description http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

• EMBL format http://www.ebi.ac.uk/embl/Documentation/User_manual/format.html

• ASN.1 (Abstract Syntax Notation One) http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html

• XML (Extensible Markup Language) http://www.w3.org/XML/

http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

http://www.ebi.ac.uk/embl/Documentation/User_manual/format.html

http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html

http://www.w3.org/XML/

References, Cont.

• "Open Source" software http://www.opensource.org

• Database management system textbooks Database Management Systems. Raghu Ramakrishnan. McGraw-Hill,

1998. ISBN 0-07-050775-9.

Principles of Database and Knowledge-Base Systems, Volume I: Classical Database Systems. Jeffrey D. Ullman. Computer Science Press, 1988. ISBN 0-7167-8158-1.

http://www.opensource.org/

Part II: Domain-Specific Considerations

OR:

"What's so special about bioinformatics anyway?"

Bioinformatics / Genomics

• Tremendous (exponential) influx of data• This by itself is not unique• Novel characteristics:

Rapidly evolving understanding/schemata Complex links in and between databases Heterogeneity of databases & resources

Advent of Genomics

• Transformation of biology: Traditional gene-by-gene approach, applying a

series of hypothesis-driven experiments. Versus highly parallel approach, applying the same

experiment to many targets.

• Analogous change in information systems. Point-and-click hyperlink chasing. “Bulk” queries or analysis requiring integration.

• Has led to wider-spread RDBMS adoption

Flashback to 1998

Mapping

Genomic Sequence

EST Sequence

90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05

Data

Volume

Human

Yeast/Bacteria

Gene Expression

Proteomics

Polymorphisms

3D Protein Structures

“Genome I: Sequence genomics” “Genome II: Functional genomics”

Arabidopsis

C. elegans

Where to Put the Sequence?

• In files or in the database? Depends on the filesystem All the usual advantages to using the database

• If in the database, what datatype to use? TEXT or CLOB Hybrid approach (varchar + CLOB)

• Should it be compressed? Only need about 32-64 bits, even for amino acids

Support For Schema Evolution

• Schema evolution We are still learning all the rules of biology

• e.g., discovery of introns

Experimental techniques are constantly improving

• e.g., microarray-based assays for RNA abundance

• In most relational databases: Fairly easy to add columns and tables Not so easy to restructure existing ones

Schema Evolution, Cont.

• Problem can be mitigated somewhat by: Higher-level abstractions & data models Judicious use of views

• Actually an argument for less structure: FASTA GFF (Gene Finding Format) DAS (Distributed Annotation System)

• Tension between extensibility and queryability

Database Heterogeneity

• Many databases, both large and small scale

• Syntactic heterogeneity Different storage formats and databases Different access mechanisms / query languages

• Fundamental semantic differences What is a gene?

• A locus associated with a particular disease?

• A specific region of an organism's DNA?

• What about chromosomal rearrangements?

Database Integration Systems

• Essentially allow the definition of views across multiple heterogeneous databases

• Provide a virtual global schema against which queries may be posed Data model and query language must be sufficiently

sophisticated to handle all the source databases

• Resolve syntactic and semantic heterogeneity The latter is typically the difficult problem

• Usually mediator or wrapper-based

Architecture of an Example System

Combining Specialized Tools with General-Purpose Databases

• e.g. Run a BLAST search, then run an SQL query on each sequence hit with p-value < 10E-30

• Two most common solutions: Add specialized searches as operations to existing

database systems

• Many allow users to define their own ADTs

• Question of how they interact with query optimizer

Add these search tools as additional "data sources" to a database integration system

References - Database Integration

• SRS (Sequence Retrieval System)

http://srs.ebi.ac.uk

System of indexed flat files that supports some RDBMS-like join and selection operations

Can only join on columns preselected by the administrator

• IBM's DiscoveryLink

http://www.ibm.com/discoverylink

Commercial offshoot of the GARLIC research project

DB2-based, relies heavily on cost-based optimization

• Multidatabase OPM

http://gizmo.lbl.gov/DM_TOOLS/DMTools.html

Commercial/now part of GeneLogic's internal development only

http://srs.ebi.ac.uk/

http://www.ibm.com/discoverylink

References, Cont.

• TAMBIS @ University of Manchester

http://img.cs.man.ac.uk/tambis/

Description logic based; permits reasoning about metadata/schema

• Kleisli / UPenn. & KRDL, Singapore

http://www.geneticXchange.com/

Now a commercial product of "geneticXchange, Inc."

• K2 / UPenn. Center for Bioinformatics

http://www.cbil.upenn.edu/K2

http://www.cis.upenn.edu/~sharker/K2_site/

Follow-on project to Kleisli; implemented in Java, not SML

References, Cont.

• Many others not marketed/targeted specifically for bioinformatics

e.g., TSIMMIS, Tukwila, Ariadne, InfoMaster

see http://www.cbil.upenn.edu/java/k2/servlet?page=links for some links

• Additional data formats and standards mentioned

DAS (Distributed Annotation System)

• http://das.wustl.edu

GFF (Gene-Finding Format)

• http://www.sanger.ac.uk/Software/formats/GFF/

Part III: Bioinformatics Database Resources on the World Wide Web

OR:

"We can build a database and put it on the network. We like databases."

Classifying Database Resources

• Organism-specific

• Datatype-specific Mapping (e.g. fingerprint, STS-content) Sequence (e.g. DNA, mRNA, protein) Phenotype (e.g. mutations, expression)

• Archival vs. curated, primary vs. secondary

• Data submission method (if any)

Classifying Databases, Cont.

• Who owns (or sells) the content? Increasing commercialization of databases At least two different varieties:

• Core data itself is proprietary (e.g. Incyte, Celera)

• Value-added model (e.g. SWISS-PROT, DoubleTwist)

• What access is provided? Browsing only Ad-hoc query facility (e.g. boolean queries) Unrestricted SQL queries against underlying DBMS

Databases - Sequence

• DNA (i.e., genomic sequence) International Collaboration of sequence DBs:

• GenBank (U.S.A. / NCBI) http://www.ncbi.nlm.nih.gov/

• EMBL (Europe / EBI): http://www.ebi.ac.uk/embl/

• DDBJ (Japan / MEXT) http://www.ddbj.nig.ac.jp/

Have evolved to accommodate high-throughput and shotgun sequencing:

• "Trace archives" for mouse and other organisms

• http://www.ncbi.nlm.nih.gov/Traces/

• ESTs - Expressed Sequence Tags dbEST

• http://www.ncbi.nlm.nih.gov/dbEST

• GenBank subset with additional EST-specific data

• Implemented in a Sybase relational database

• SNPs - Single Nucleotide Polymorphisms dbSNP

• http://www.ncbi.nlm.nih.gov/SNP/

• Very similar to dbEST in philosophy and implementation

• Many commercial databases Celera, Incyte, etc.

• Proteins (because most genes code for them) SWISS-PROT and TrEMBL

• http://www.expasy.ch/sprot/sprot-top.html

• Manually and automatically annotated protein sequences

• Became a commercial database circa 1998

• Data modeling decision: one entry may represent the same protein from different organisms if the sequences are sufficiently similar. What is a protein?

• Protein motifs (because proteins are modular) PRODOM

• http://prodes.toulouse.inra.fr/prodom/doc/prodom.html

• Consensus motifs from multiple sequence alignments

• Some of the newer families were built automatically from entries in Pfam-A; illustrates interdependency of databases, potential for error propagation.

PROSITE

• http://www.expasy.ch/prosite

• Manually curated, much like SWISS-PROT with regular expression style motifs, e.g. W-x(9,11)-[VFY]-[FYW]-x(6,7)-[GSTNE]-P

• Like SWISS-PROT, also a commercial venture

PRINTS

• http://bioinf.man.ac.uk/dbbrowser/PRINTS

• Database of protein "fingerprints"

• Now has a relational version, "PRINTS-S"

BLOCKS

• http://blocks.fhcrc.org/blocks/

• Automatically determined ungapped conserved segments

Pfam (WashU/Sanger)

• http://pfam.wustl.edu/

• Protein families represented by profile HMMs generated from multiple sequence alignments.

CDD (Conserved Domain Database)

• http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

• Based on SMART and Pfam, with some additions

INTERPRO

• http://www.ebi.ac.uk/interpro/

• Includes Pfam, PRINTS, PROSITE, ProDom, SMART

• Mapping from Interpro domains to GO

3-Dimensional Protein Structures

• PDB (Protein Data Bank) http://www.rcsb.org/pdb Who can argue with the database that brings you the

"molecule of the month"?

• MMDB (Molecular Modeling DataBase) http://www.ncbi.nlm.nih.gov/Structure/ Experimentally-determined 3D structures obtained

from PDB; another case of derived data (data provenance, anyone?)

Databases - Expression

• GXD (mouse Gene Expression Database) http://www.informatics.jax.org/mgihome/GXD/gxdgen.shtml

In-situ expression studies, Northern & Western blots, and RT-PCR experiments, in addition to the (now) typical high-density microarray data.

Linked to a database of anatomy and a 3D atlas; Sybase DB.

• Newer databases focus on array-based data

MGED standards effort http://www.mged.org

Genome/Integrated

• Golden Path (UCSC) http://genome.ucsc.edu Draft public human genome assembly produced by

Jim Kent in the Haussler lab. Includes many additional "tracks" or datasets provided by other sites (e.g., gene predictions, markers, etc.)

• Ensembl (EBI & Sanger Centre) http://www.ensembl.org Uses the Golden Path assembly as the basis for its

automated analyses.

Integrated Databases/Systems

• Entrez (NCBI) http://www.ncbi.nlm.nih.gov/Entrez/ Integrated boolean query interface Includes MMDB, PubMed, GenBank, GenPept, etc.

• SRS - Sequence Retrieval System http://srs.ebi.ac.uk Mentioned earlier under database integration

"Model Organism" Databases

• Mouse (mus musculus)

• Fly (drosophila melanogaster)

• Arabidopsis (arabidopsis thaliana)

• Worm (caenorhabditis elegans)

• E. Coli (eschericia coli)

• Yeast (saccharomyces cerevisiae)

• Zebrafish (danio rerio)

Genome/Mapping & Phenotype

• GDB (Genome DataBase) http://gdbwww.gdb.org Human mapping database; focus on genes and loci Implemented in OPM over Sybase (now Oracle 8i?)

• MGD (Mouse Genome Database) http://www.informatics.jax.org/ Mouse equivalent of GDB; implemented in Sybase Integrated with sister databases GXD and MGS

• TAIR (The Arabidopsis Information Resource) http://www.arabidopsis.org/ Based on Sybase database from NCGR

• AceDB (A C. Elegans Database) http://www.acedb.org/ Example of a DBMS developed specifically for a

particular biological application (and organism!) Very popular due to graphical interface But lacks many basic RDBMS features

• WormBase http://www.wormbase.org/ Replacement for AceDB; initially a retrofitting of new

user interfaces to the existing Ace DB engine

• YPD (Yeast Proteome Database) http://www.proteome.com/DB-demo/intro-to-YPD.html Another merger victim; Proteome, Inc. & Incyte

• SGD (Saccharamyces Genome Database) http://genome-www.stanford.edu/Saccharomyces/

Databases - Others

• MEDLINE/PubMed - literature http://www4.ncbi.nlm.nih.gov/PubMed Integrated into NCBI's "Entrez" system.

• OMIM - Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov A database that started out as a book (MIM). Provides a view of the world based on human

diseases with a genetic component.

• TRANSFAC http://transfac.gbf.de See also http://www.cbil.upenn.edu/tess/ Transcription factors and their genes; transcription

factor binding site models.

• GeneCards http://bioinformatics.weizmann.ac.il/cards/ Built through the magic of Perl. Attempts to

automatically summarize all available information and database links for known genes.

The Database Databases

• DBCAT (Infobiogen) http://www.infobiogen.fr/services/dbcat/ Current total: 511 databases (as of 10/18/2001) Self-referential; contains itself as an entry

• The Biocatalog (EBI) http://www.ebi.ac.uk/biocat/ Does not appear to list itself... Last updated July 25, 2000

Documents

Databases for Bioinformatics and Genomics Jonathan Crabtree [email protected] Center for Bioinformatics University of Pennsylvania