76
Databases for Bioinformatics and Genomics Jonathan Crabtree [email protected] Center for Bioinformatics University of Pennsylvania

Databases for Bioinformatics and Genomics Jonathan Crabtree [email protected] Center for Bioinformatics University of Pennsylvania

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Databases for Bioinformatics and Genomics

Jonathan [email protected]

Center for BioinformaticsUniversity of Pennsylvania

Page 2: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

OUTLINE

• Introduction to relational databases

• Some problems specific to bioinformatics

• Survey of bioinformatics databases and resources on the web

Page 3: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Part I: An Introduction to Relational Databases

OR:

"Just what is the difference between Larry Ellison and God?"

Page 4: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

What is a Database?

• Collection of data, oft. organized for fast retrieval

• Example: BLAST database (FASTA file)>gi|14728383|ref|XM_003323.4| Homo sapiens clock (mouse) homolog (CLOCK), mRNAGAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTAAGATGCCTTTGGATAATTCTACAGTCCTCTTAAATGAATCTTTAGAACTTGGCAAGTCTCACTAGATACCTTCAATCATCATTTTGAGCTCAAAGAATTCTGAGACTTATGGTTGGTCATATAGAAGAGTACCTTGAACCTATAGTTTCC..>gi|5453027|gb|AF071947.1|MMTECH13 Mus musculus protein tyrosine kinase Txk (Txk)

gene, exon 1 and partial cdsTTCTCGTCCCGTCTTCGTCTCTCGTCTCGTCGTCGTCGTCGTCTCTCTGCCTCTCCGTCTCGTCGGATCTCGTCGTCTCTCGTCTCGTCTTTCTCTCGTCTGTCAAGGTTTCTCTATGTAACCCTGGCTGTCCTGGAACTCTATGTAGACCAGGCTGGCCTTGAACTCACTCGCTTCTGCCTCCCAGAGTGCTGGGATTAAAGACTTTTGCCACCACACTTAGCTTCAGCCAGTTTTTAAAACAGATCTTTAGATTCTCCCCTAATCCCAAACCAAGTCT..>gi|1161348|gb|U34367.1|HSTECTXT01 Human protein tyrosine kinase TEC (tec) gene,

partial cds, and tyrosine kinase TXK (txk) gene, exon 1TCTAGATGCTTTATACCACTTCCTCTGGAGCAGCTCTGCTTTAGATTTTTTACATATNGGGCCTTTGGGTAAAATTTTATTTAAAGAACAAAATTGCTTACAAAGAAAAGCTTGCAAACTATTTTATGATACCTTGTCTCTACTGCCTTATAAAAAGAAAGAAAAACAGAAAAATGAAAGTGCAGATGGTAACGTGGTGATGGATCCATCTCATGCACTTAAATTAAGCTGAGAGGAAGTTGGAAATCACCAGGTGTGGCCACTGAGGTGAACATTTAGC..

Page 5: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

BLAST Database Limitations

• Scarce on structured data (except the sequence)

• Not guaranteed to have any unique identifier

• Defline is often structured by convention only (as in our example)

• Designed to answer only one class of queries: "What do you have that's similar to this sequence?"

Page 6: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

ID HSECTXT01 standard; DNA; HUM; 5579 BP. XX AC U34367; XX SV U34367.1 XX DT 24-JAN-1996 (Rel. 46, Created) DT 02-JUL-1999 (Rel. 60, Last updated, Version 7) XX DE Human protein tyrosine kinase TEC (tec) gene, partial cds, and tyrosine DE kinase TXK (txk) gene, exon 1. XX KW . XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RX MEDLINE; 96197775. RA Ohta Y., Haire R.N., Amemiya C.T., Litman R.T., Trager T., Riess O., RA Litman G.W.; RT "Human Txk: genomic organization, structure and contiguous physical linkage RT with the Tec gene"; RL Oncogene 12(4):937-942(1996). XX RN [2] RP 1-5579 RA Litman G.W.; RT ; RL Submitted (18-AUG-1995) to the EMBL/GenBank/DDBJ databases. RL Gary W. Litman, All Childrens Hospital, 801 Sixth Street South, St RL Petersburg, FL 33701, USA

Compare with FASTA/BLAST format:>gi|1161348|gb|U34367.1|HSTECTXT01 Human protein tyrosine kinase TEC (tec) gene, partial cds, and tyrosine kinase TXK (txk) gene, exon 1

EMBL format

Page 7: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

XX DR SPTREMBL; Q14219; Q14219. DR SWISS-PROT; P42681; TXK_HUMAN. XX FH Key Location/Qualifiers FH FT source 1..5579 FT /db_xref="taxon:9606" FT /organism="Homo sapiens" FT /cell_type="leukocyte" FT /tissue_type="blood" FT /dev_stage="adult" FT intron <1..407 FT /gene="tec" FT mRNA join(<408..526,608..765,1931..3624) FT /gene="tec" FT /product="protein tyrosine kinase" FT CDS join(<408..526,608..765,1931..2014) FT /codon_start=2 FT /db_xref="SPTREMBL:Q14219" FT /gene="tec" FT /product="protein tyrosine kinase" FT /protein_id="AAB60411.1" FT /translation="YVLDDQYTSSSGAKFPVKWCPPEVFNYSRFSSKSDVWSFGVLMWE FT VFTEGRMPFEKYTNYEVVTMVTRGHRLYQPKLASNYVYEVMLRCWQEKPEGRPSFEDLL FT RTIDELVECEETFGR" FT exon 5166..5267 FT /number=1 FT /gene="txk" FT /product="tyrosine kinase" XX SQ Sequence 5579 BP; 1512 A; 1112 C; 1275 G; 1679 T; 1 other; tctagatgct ttataccact tcctctggag cagctctgct ttagattttt tacatatngg 60 gcctttgggt aaaattttat ttaaagaaca aaattgctta caaagaaaag cttgcaaact 120 attttatgat accttgtctc tactgcctta taaaaagaaa gaaaaacaga aaaatgaaag 180 tgcagatggt aacgtggtga tggatccatc tcatgcactt aaattaagct gagaggaagt 240 tggaaatcac caggtgtggc cactgaggtg aacatttagc ttcaaccgga tgcttgttgg 300 . . .

Page 8: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

EMBL Format

• Considered a "flat-file" format (i.e. plain text)

• But it allows for additional explicit structure Each entry has multiple named "fields" The meanings of these fields are predefined

• Entries have a unique identifer or "accession" "Primary key" in relational database lingo And its presence leads to an obvious query: "Tell me

everything you know about the entry with ID X" Can be answered quickly by indexing the file

Page 9: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Indexed Flat File

1 ID HSECTXT01 standard; DNA; HUM; 5579 BP. 2 XX 3 AC U34367; 4 XX 5 SV U34367.1 6 XX 7 DT 24-JAN-1996 (Rel. 46, Created) 8 DT 02-JUL-1999 (Rel. 60, Last updated, Version 7) 9 XX 10 DE Human protein tyrosine kinase TEC (tec) gene, partial cds, and 11 DE tyrosine kinase TXK (txk) gene, exon 1. 12 XX 13 KW . 14 XX . .4073 ID AF071947 standard; DNA; ROD; 1381 BP.4074 XX4075 AC AF071947;4076 XX4077 SV AF071947.14078 XX4079 DT 13-JUL-1999 (Rel. 60, Created)4080 DT 13-JUL-1999 (Rel. 60, Last updated, Version 1)4081 XX4082 DE Mus musculus protein tyrosine kinase Txk (Txk) gene, exon 1 and 4083 DE partial cds...N lines

INDEX FILE EMBL-FORMAT FLAT FILE

U34367.1 1AF071947.1 4073...M entries

Find entry with a particular ID:

Naïve approach: O(N)

Using an index: O(M)

Can be improved: O(log M) or O(C)

Page 10: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Human protein tyrosine kinase TEC (tec) gene, partial cds, and tyrosine kinase TXK (txk) gene, and translated products" , source { org { taxname "Homo sapiens" , common "human" , db { { db "taxon" , tag id 9606 } } , orgname { name binomial { genus "Homo" , species "sapiens" } , lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo" , gcode 1 , mgcode 2 , div "PRI" } } ,.. from journal { title { iso-jta "Oncogene" } , imp { date std { year 1996 , month 2 , day 15 } , volume "12" , issue "4" , pages "937-942" } } } } ,..

Compare with EMBL format: RL Oncogene 12(4):937-942(1996).

ASN.1 (GenBank)

Page 11: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

ASN.1

• Unlike EMBL format, it allows one to give a formal description of what data is valid So its validity can be checked automatically Same idea as DTDs for XML, which is quite similar

• But, as with everything we've seen so far, ASN.1 data is typically stored in plain text files

• Used extensively for data management at NCBI (i.e. GenBank, PubMed, Entrez)

Page 12: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Data Model vs. Storage Format

Increasing sophistication of data model

Increasing sophistication of physical storage mechanism

flat file indexed flat file RDBMS OODBMS

FASTA format EMBL format XML/ASN.1/object oriented(free text) (tag,value pairs) (arbitrary nesting of data)

• Can store FASTA files in a relational database

• Or highly-structured datasets in a flat file

Page 13: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

What's Wrong with Flat Files?

• Only one person can edit them at any given time

• Edits must be checked for validity

• Can become corrupted if the computer crashes

• Don't support queries without additional data structures and/or search tools (e.g., BLAST)

• Implementing access controls will likely require splitting the file or writing additional software

• Many systems have fairly low file size limits

• Database guys will laugh at you behind your back

Page 14: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Relational (and other) databases:

• Solve many of these problems in a generic way

• Provide a reasonable sophisticated data model

• Enable fast queries over data stored in that model

• Ensure the integrity & consistency of the data

Page 15: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Components of a Database

• Data model - what the data looks like How is the database structured? Note that a database can have structure that is not

explicitly identified by the data model.

• Access method(s) - how it can be accessed Usually based on the data model Could be an external tool like BLAST Or a general-purpose "query language"

Page 16: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

The Relational Data ModelSCHEMA/DATABASE

Sequencesgi accession locus description14728383 XM_003323.4 Homo sapiens clock(mouse)...5453027 AF071947.1 MMTECH13 Mus musculus protein tyrosine...1161348 U34367.1 HSTECTXT01 Human protein tyrosine kinase...

RELATION/TABLE

ROW

COLUMN/FIELD

Sequencesgi accession locus descriptionint(10) varchar(15) varchar(20) varchar(200)DATATYPES

Page 17: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Relational Query Languages

• The Structured Query Language (SQL) First invented in the early 70's

• SQL incorporates: DDL - Data Definition Language

• Create/modify/delete schema objects

• Grant/revoke access to them

DML - Data Manipulation Language

• Used for entering and retrieving data (the "query" part)

Page 18: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

What men or gods are these? What maidens loth? What mad pursuit? What struggle to escape? What pipes and timbrels? What wild ecstasy?

-John Keats, Ode on a Grecian Urn

What is the average salary in the Toy department?

-Anonymous SQL user

(From "Database Management Systems" by Raghu Ramakrishnan, p. 181)

Page 19: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

DDL: Data Definition Language

CREATE TABLE Sequences ( gi int(10), accession varchar(15), locus varchar(20), description varchar(200));

GRANT SELECT ON Sequences TO fred;

GRANT SELECT ON Sequences TO fred WITH GRANT OPTION;

DROP TABLE Sequences;

Page 20: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

DML: Data Manipulation Language

INSERT INTO Sequences VALUES(14728383, 'XM_003323.4', NULL, 'Homo sapiens clock(mouse) homolog (CLOCK), mRNA');

INSERT INTO Sequences VALUES(5453027, 'AF071947.1', 'MMTECH13','Mus musculus protein tyrosine kinase Txk (Txk) gene, exon 1 and partial cds');

UPDATE Sequences SET locus = 'CLOCK' WHERE gi = 14728383;

DELETE FROM Sequences WHERE gi = 14728383;

Page 21: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

DML Queries

A basic SQL query:

SELECT gi, description FROM Sequences WHERE locus = 'MMTECH13';

The "WHERE" clause is optional:

SELECT gi, description FROM Sequences; SELECT * FROM Sequences;

"NULL" values get special treatment:

SELECT * FROM Sequences WHERE locus IS NULL; (not always the same as "WHERE locus = NULL")

Page 22: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Joins of Two Or More TablesSequencesgi accession locus description14728383 XM_003323.4 Homo sapiens clock(mouse)...5453027 AF071947.1 MMTECH13 Mus musculus protein tyrosine...1161348 U34367.1 HSTECTXT01 Human protein tyrosine kinase...Genesname species sequence_gi descriptionClock human 14728383 Circadian Locomotor Output Cycles KaputTxk mouse 5453027 Protein tyrosine kinase TXKTec human 1161348 Tec protein tyrosine kinaseTxk human 1161348 Protein tyrosine kinase TXK

SELECT g.name, s.accession, s.locus FROM Sequences s, Genes g WHERE g.gene_name = 'Txk' AND g.species = 'human' AND g.sequence_gi = s.gi;

Page 23: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Joins, Cont.Query resultg.name s.accession s.locusTxk U34367.1 HSTECTXT01

• Note that: The result of the query is itself a relational table The columns used to perform the join (gi and sequence_gi) need not appear in the result table

Our table aliases appear in the column names

• This type of join is call an "equijoin"

• "Select", "Project", and "Join" basic operators

Page 24: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

How are Joins Performed?"NESTED LOOP" ITERATION FOR OUR EXAMPLE JOIN(IGNORING THE RESTRICTIONS ON NAME and SPECIES):

for each row G in table Genes { for each row S in table Sequences { if (G.sequence_gi = S.gi) then add G+S to the result }}

MORE GENERALLY, G(row1,row2) IS ANY BOOLEAN FUNCTION:

for each row R in table T1 { for each row S in table T2 { if G(R,S) then add R+S to the result }}

Page 25: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Another Join Strategy for EquijoinsASSUME WE HAVE AN INDEX ON Sequences.gi:

for each row G in table Genes { seqRows = gi_index_lookup(g.sequence_gi) for each row S in set seqRows { add G+S to the result }}

OR IF WE HAD AN INDEX ON Genes.sequence_gi:

for each row s in table Sequences { geneRows = sequence_gi_index_lookup(s.gi); for each row G in geneRows { add G+S to the result }}

Page 26: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Query Optimization

• Choosing a specific query plan for a query Query plans are assembled by choosing from among

predefined strategies like those we just saw

• Much database research is devoted to this topic Query plans are usually not optimal Can be a highly computationally intensive process,

particularly when many tables are involved

Page 27: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

What Must the Optimizer Decide?

• Order in which to perform operations Particularly the order in which to perform joins Accounts for much of the complexity

• Join strategy and/or index selection It is not always faster to use an index

• Other considerations Optimize for response time or overall time? Use additional resources to speed up queries?

Page 28: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Cost-based Optimization versus Rule-based

• Cost-based Enumerates a set of plans and then chooses the "best" Best = lowest predicted "cost" Cost usually defined in terms of disk accesses Relies on table statistics / selectivity estimates

• Rule-based Applies a set of predefined rules to transform a single

initial plan until no more rules apply e.g., always use the most specific index possible

Page 29: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

RDBMS Advantages Revisited

• Permit arbitrary queries on structured data

• Enforce constraints on the data

• Maintaining data integrity in the face of concurrent updates - typically with transactions

• Support different logical views on the data

Page 30: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

An Introduction to Constraints

• The following is also a valid FASTA file:>yadaGAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTAAGATGCCTTTGGATAATTCTACAGTCCTCTTAAATGAATCTTTAGAACTTGGCAAGTCTCACTAGATACCTTCAATCATCATTTTGAGCTCAAAGAATTCTGAGACTTATGGTTGGTCATATAGAAGAGTACCTTGAACCTATAGTTTCC>yadaTTCTCGTCCCGTCTTCGTCTCTCGTCTCGTCGTCGTCGTCGTCTCTCTGCCTCTCCGTCTCGTCGGATCTCGTCGTCTCTCGTCTCGTCTTTCTCTCGTCTGTCAAGGTTTCTCTATGTAACCCTGGCTGTCCTGGAACTCTATGTAGACCAGGCTGGCCTTGAACTCACTCGCTTCTGCCTCCCAGAGTGCTGGGATTAAAGACTTTTGCCACCACACTTAGCTTCAGCCAGTTTTTAAAACAGATCTTTAGATTCTCCCCTAATCCCAAACCAAGTCT>TCTAGATGCTTTATACCACTTCCTCTGGAGCAGCTCTGCTTTAGATTTTTTACATATNGGGCCTTTGGGTAAAATTTTATTTAAAGAACAAAATTGCTTACAAAGAAAAGCTTGCAAACTATTTTATGATACCTTGTCTCTACTGCCTTATAAAAAGAAAGAAAAACAGAAAAATGAAAGTGCAGATGGTAACGTGGTGATGGATCCATCTCATGCACTTAAATTAAGCTGAGAGGAAGTTGGAAATCACCAGGTGTGGCCACTGAGGTGAACATTTAGC

Duplicate and missing sequence identifiers No restrictions on duplicated sequences Some descriptions do require a "sequence name"

Page 31: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

BLASTN 2.0MP-WashU [16-Dec-1999] [linux-x86 23:30:47 16-Dec-1999]

Copyright (C) 1996-1999 Washington University, Saint Louis, Missouri USA.All Rights Reserved.

Reference: Gish, W. (1996-1999) http://blast.wustl.edu

Notice: this program and its default parameter settings are optimized to findnearly identical sequences rapidly. To identify weak similarities encoded innucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query= Query sequence (210 letters)

Database: test.fsa 3 sequences; 770 total letters.Searching....10....20....30....40....50....60....70....80....90....100% done

Smallest Sum High ProbabilitySequences producing High-scoring Segment Pairs: Score P(N) N

yada 1050 5.5e-47 1

>yada Length = 210

Plus Strand HSPs:

Score = 1050 (163.6 bits), Expect = 5.5e-47, P = 5.5e-47 Identities = 210/210 (100%), Positives = 210/210 (100%), Strand = Plus / Plus

Query: 1 GAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTA 60 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1 GAATTTTTACTTGTTCCTGCAAAGCTGCTGGAGCTCAGAAGCTGATTCTATCACATTGTA 60

.

.

.

Page 32: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Constraints, Cont.

• BLAST is perfectly happy with this database

• But ideally we would like to ensure: Every sequence has some kind of identifier

• so something shows up in our output

Each identifier is unique, at least within a database

• so we know what we've hit

• In a relational database we would use: "NOT NULL" constraint "UNIQUE" constraint

Page 33: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Constraints in SQL

CREATE TABLE Sequences ( gi int(10) NOT NULL, accession varchar(15) NOT NULL, locus varchar(20), description varchar(200), PRIMARY KEY(gi), UNIQUE(accession), INDEX(accession));

Page 34: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Transactions: "ACID" Properties

• Atomicity All operations in a transaction fail or all succeed

(even in the event of a system crash) There are no "partially processed" transactions

• Consistency If the database is in a consistent state before the

transaction starts, then it must be in a consistent state when it ends (e.g., with respect to its constraints.)

Page 35: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Transactions, Cont.

• Isolation Actions carried out in one transaction should not be

affected by those carried out in another (unless the other completes before it starts.)

• Durability Once the system reports a transaction has succeeded

its effects must persist in the database Even in the event of one or more system crashes

Page 36: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Implementing Transaction Isolation

• Typically entails the use of "locks"

• Locks come in different flavors, for example: read-only (or "shared") exclusive

• 2-phase locking is one relatively simple scheme to guarantee serializability

• Locking can unfortunately result in "deadlock"

Page 37: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Ensuring Atomicity and Durability

• Write-Ahead Logging Two distinct storage areas in most databases Write to the log before updating the actual tables And before saying that a commit has succeeded Then transactions can be rolled back or forward

• "Well, I hope you've got the WAL on a RAID-1 volume. Keep up the good work."

Page 38: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

A Word On Views

• A view is a table constructed by a query

• May or may not be "materialized"

• May or may not be updateable

• Can be used to implement access controlCREATE VIEW HumanGenesAS SELECT * FROM Genes WHERE g.species = 'human';

GRANT SELECT ON HumanGenes TO bob;

Page 39: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Commercial Database Systems

• Sybase and Oracle Arch-enemies, with Oracle the current leader Both now include object-relational features and are

marketed as so-called "universal" databases

• IBM's DB2 IBM has a long history of RDBMS development,

beginning with "System R" in the 1970s

• O2 ("O deux"), Poet, Versant, etc.

Page 40: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Open Source Database Systems

• MySQL http://www.mysql.com Very popular; reported to be fast for queries Lacks a few common SQL features, some by design

• Subqueries

• Transactions (until recently)

• Foreign key constraints

• Stored procedures & triggers

• Views

Page 41: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Open Source Database Systems

• PostgreSQL (post-gres-Q-L) http://www.us.postgresql.org/index.html Object-relational DBMS Based on the POSTGRES research prototype from

Stonebraker's lab. in UCal. Berkeley Currently enjoying a resurgence in popularity Similar ideas were commercialized by Stonebraker in

Illustra, which was acquired by Informix circa 1995

Page 42: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

References

• FASTA format description http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

• EMBL format http://www.ebi.ac.uk/embl/Documentation/User_manual/format.html

• ASN.1 (Abstract Syntax Notation One) http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html

• XML (Extensible Markup Language) http://www.w3.org/XML/

Page 43: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

References, Cont.

• "Open Source" software http://www.opensource.org

• Database management system textbooks Database Management Systems. Raghu Ramakrishnan. McGraw-Hill,

1998. ISBN 0-07-050775-9.

Principles of Database and Knowledge-Base Systems, Volume I: Classical Database Systems. Jeffrey D. Ullman. Computer Science Press, 1988. ISBN 0-7167-8158-1.

Page 44: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Part II: Domain-Specific Considerations

OR:

"What's so special about bioinformatics anyway?"

Page 45: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Bioinformatics / Genomics

• Tremendous (exponential) influx of data• This by itself is not unique• Novel characteristics:

Rapidly evolving understanding/schemata Complex links in and between databases Heterogeneity of databases & resources

Page 46: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Advent of Genomics

• Transformation of biology: Traditional gene-by-gene approach, applying a

series of hypothesis-driven experiments. Versus highly parallel approach, applying the same

experiment to many targets.

• Analogous change in information systems. Point-and-click hyperlink chasing. “Bulk” queries or analysis requiring integration.

• Has led to wider-spread RDBMS adoption

Page 47: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Flashback to 1998

Mapping

Genomic Sequence

EST Sequence

90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05

Data

Volume

Human

Yeast/Bacteria

Gene Expression

Proteomics

Polymorphisms

3D Protein Structures

“Genome I: Sequence genomics” “Genome II: Functional genomics”

Arabidopsis

C. elegans

Page 48: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Where to Put the Sequence?

• In files or in the database? Depends on the filesystem All the usual advantages to using the database

• If in the database, what datatype to use? TEXT or CLOB Hybrid approach (varchar + CLOB)

• Should it be compressed? Only need about 32-64 bits, even for amino acids

Page 49: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Support For Schema Evolution

• Schema evolution We are still learning all the rules of biology

• e.g., discovery of introns

Experimental techniques are constantly improving

• e.g., microarray-based assays for RNA abundance

• In most relational databases: Fairly easy to add columns and tables Not so easy to restructure existing ones

Page 50: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Schema Evolution, Cont.

• Problem can be mitigated somewhat by: Higher-level abstractions & data models Judicious use of views

• Actually an argument for less structure: FASTA GFF (Gene Finding Format) DAS (Distributed Annotation System)

• Tension between extensibility and queryability

Page 51: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Database Heterogeneity

• Many databases, both large and small scale

• Syntactic heterogeneity Different storage formats and databases Different access mechanisms / query languages

• Fundamental semantic differences What is a gene?

• A locus associated with a particular disease?

• A specific region of an organism's DNA?

• What about chromosomal rearrangements?

Page 52: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Database Integration Systems

• Essentially allow the definition of views across multiple heterogeneous databases

• Provide a virtual global schema against which queries may be posed Data model and query language must be sufficiently

sophisticated to handle all the source databases

• Resolve syntactic and semantic heterogeneity The latter is typically the difficult problem

• Usually mediator or wrapper-based

Page 53: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Architecture of an Example System

Page 54: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Combining Specialized Tools with General-Purpose Databases

• e.g. Run a BLAST search, then run an SQL query on each sequence hit with p-value < 10E-30

• Two most common solutions: Add specialized searches as operations to existing

database systems

• Many allow users to define their own ADTs

• Question of how they interact with query optimizer

Add these search tools as additional "data sources" to a database integration system

Page 55: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

References - Database Integration

• SRS (Sequence Retrieval System)

http://srs.ebi.ac.uk

System of indexed flat files that supports some RDBMS-like join and selection operations

Can only join on columns preselected by the administrator

• IBM's DiscoveryLink

http://www.ibm.com/discoverylink

Commercial offshoot of the GARLIC research project

DB2-based, relies heavily on cost-based optimization

• Multidatabase OPM

http://gizmo.lbl.gov/DM_TOOLS/DMTools.html

Commercial/now part of GeneLogic's internal development only

Page 56: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

References, Cont.

• TAMBIS @ University of Manchester

http://img.cs.man.ac.uk/tambis/

Description logic based; permits reasoning about metadata/schema

• Kleisli / UPenn. & KRDL, Singapore

http://www.geneticXchange.com/

Now a commercial product of "geneticXchange, Inc."

• K2 / UPenn. Center for Bioinformatics

http://www.cbil.upenn.edu/K2

http://www.cis.upenn.edu/~sharker/K2_site/

Follow-on project to Kleisli; implemented in Java, not SML

Page 57: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

References, Cont.

• Many others not marketed/targeted specifically for bioinformatics

e.g., TSIMMIS, Tukwila, Ariadne, InfoMaster

see http://www.cbil.upenn.edu/java/k2/servlet?page=links for some links

• Additional data formats and standards mentioned

DAS (Distributed Annotation System)

• http://das.wustl.edu

GFF (Gene-Finding Format)

• http://www.sanger.ac.uk/Software/formats/GFF/

Page 58: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Part III: Bioinformatics Database Resources on the World Wide Web

OR:

"We can build a database and put it on the network. We like databases."

Page 59: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Classifying Database Resources

• Organism-specific

• Datatype-specific Mapping (e.g. fingerprint, STS-content) Sequence (e.g. DNA, mRNA, protein) Phenotype (e.g. mutations, expression)

• Archival vs. curated, primary vs. secondary

• Data submission method (if any)

Page 60: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Classifying Databases, Cont.

• Who owns (or sells) the content? Increasing commercialization of databases At least two different varieties:

• Core data itself is proprietary (e.g. Incyte, Celera)

• Value-added model (e.g. SWISS-PROT, DoubleTwist)

• What access is provided? Browsing only Ad-hoc query facility (e.g. boolean queries) Unrestricted SQL queries against underlying DBMS

Page 61: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Databases - Sequence

• DNA (i.e., genomic sequence) International Collaboration of sequence DBs:

• GenBank (U.S.A. / NCBI) http://www.ncbi.nlm.nih.gov/

• EMBL (Europe / EBI): http://www.ebi.ac.uk/embl/

• DDBJ (Japan / MEXT) http://www.ddbj.nig.ac.jp/

Have evolved to accommodate high-throughput and shotgun sequencing:

• "Trace archives" for mouse and other organisms

• http://www.ncbi.nlm.nih.gov/Traces/

Page 62: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

• ESTs - Expressed Sequence Tags dbEST

• http://www.ncbi.nlm.nih.gov/dbEST

• GenBank subset with additional EST-specific data

• Implemented in a Sybase relational database

• SNPs - Single Nucleotide Polymorphisms dbSNP

• http://www.ncbi.nlm.nih.gov/SNP/

• Very similar to dbEST in philosophy and implementation

• Many commercial databases Celera, Incyte, etc.

Page 63: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

• Proteins (because most genes code for them) SWISS-PROT and TrEMBL

• http://www.expasy.ch/sprot/sprot-top.html

• Manually and automatically annotated protein sequences

• Became a commercial database circa 1998

• Data modeling decision: one entry may represent the same protein from different organisms if the sequences are sufficiently similar. What is a protein?

• Protein motifs (because proteins are modular) PRODOM

• http://prodes.toulouse.inra.fr/prodom/doc/prodom.html

• Consensus motifs from multiple sequence alignments

• Some of the newer families were built automatically from entries in Pfam-A; illustrates interdependency of databases, potential for error propagation.

Page 64: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

PROSITE

• http://www.expasy.ch/prosite

• Manually curated, much like SWISS-PROT with regular expression style motifs, e.g. W-x(9,11)-[VFY]-[FYW]-x(6,7)-[GSTNE]-P

• Like SWISS-PROT, also a commercial venture

PRINTS

• http://bioinf.man.ac.uk/dbbrowser/PRINTS

• Database of protein "fingerprints"

• Now has a relational version, "PRINTS-S"

BLOCKS

• http://blocks.fhcrc.org/blocks/

• Automatically determined ungapped conserved segments

Page 65: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Pfam (WashU/Sanger)

• http://pfam.wustl.edu/

• Protein families represented by profile HMMs generated from multiple sequence alignments.

CDD (Conserved Domain Database)

• http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

• Based on SMART and Pfam, with some additions

INTERPRO

• http://www.ebi.ac.uk/interpro/

• Includes Pfam, PRINTS, PROSITE, ProDom, SMART

• Mapping from Interpro domains to GO

Page 66: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

3-Dimensional Protein Structures

• PDB (Protein Data Bank) http://www.rcsb.org/pdb Who can argue with the database that brings you the

"molecule of the month"?

• MMDB (Molecular Modeling DataBase) http://www.ncbi.nlm.nih.gov/Structure/ Experimentally-determined 3D structures obtained

from PDB; another case of derived data (data provenance, anyone?)

Page 67: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Databases - Expression

• GXD (mouse Gene Expression Database) http://www.informatics.jax.org/mgihome/GXD/gxdgen.shtml

In-situ expression studies, Northern & Western blots, and RT-PCR experiments, in addition to the (now) typical high-density microarray data.

Linked to a database of anatomy and a 3D atlas; Sybase DB.

• Newer databases focus on array-based data

MGED standards effort http://www.mged.org

Page 68: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Genome/Integrated

• Golden Path (UCSC) http://genome.ucsc.edu Draft public human genome assembly produced by

Jim Kent in the Haussler lab. Includes many additional "tracks" or datasets provided by other sites (e.g., gene predictions, markers, etc.)

• Ensembl (EBI & Sanger Centre) http://www.ensembl.org Uses the Golden Path assembly as the basis for its

automated analyses.

Page 69: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Integrated Databases/Systems

• Entrez (NCBI) http://www.ncbi.nlm.nih.gov/Entrez/ Integrated boolean query interface Includes MMDB, PubMed, GenBank, GenPept, etc.

• SRS - Sequence Retrieval System http://srs.ebi.ac.uk Mentioned earlier under database integration

Page 70: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

"Model Organism" Databases

• Mouse (mus musculus)

• Fly (drosophila melanogaster)

• Arabidopsis (arabidopsis thaliana)

• Worm (caenorhabditis elegans)

• E. Coli (eschericia coli)

• Yeast (saccharomyces cerevisiae)

• Zebrafish (danio rerio)

Page 71: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Genome/Mapping & Phenotype

• GDB (Genome DataBase) http://gdbwww.gdb.org Human mapping database; focus on genes and loci Implemented in OPM over Sybase (now Oracle 8i?)

• MGD (Mouse Genome Database) http://www.informatics.jax.org/ Mouse equivalent of GDB; implemented in Sybase Integrated with sister databases GXD and MGS

Page 72: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

• TAIR (The Arabidopsis Information Resource) http://www.arabidopsis.org/ Based on Sybase database from NCGR

• AceDB (A C. Elegans Database) http://www.acedb.org/ Example of a DBMS developed specifically for a

particular biological application (and organism!) Very popular due to graphical interface But lacks many basic RDBMS features

Page 73: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

• WormBase http://www.wormbase.org/ Replacement for AceDB; initially a retrofitting of new

user interfaces to the existing Ace DB engine

• YPD (Yeast Proteome Database) http://www.proteome.com/DB-demo/intro-to-YPD.html Another merger victim; Proteome, Inc. & Incyte

• SGD (Saccharamyces Genome Database) http://genome-www.stanford.edu/Saccharomyces/

Page 74: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

Databases - Others

• MEDLINE/PubMed - literature http://www4.ncbi.nlm.nih.gov/PubMed Integrated into NCBI's "Entrez" system.

• OMIM - Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov A database that started out as a book (MIM). Provides a view of the world based on human

diseases with a genetic component.

Page 75: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

• TRANSFAC http://transfac.gbf.de See also http://www.cbil.upenn.edu/tess/ Transcription factors and their genes; transcription

factor binding site models.

• GeneCards http://bioinformatics.weizmann.ac.il/cards/ Built through the magic of Perl. Attempts to

automatically summarize all available information and database links for known genes.

Page 76: Databases for Bioinformatics and Genomics Jonathan Crabtree crabtree@pcbi.upenn.edu Center for Bioinformatics University of Pennsylvania

The Database Databases

• DBCAT (Infobiogen) http://www.infobiogen.fr/services/dbcat/ Current total: 511 databases (as of 10/18/2001) Self-referential; contains itself as an entry

• The Biocatalog (EBI) http://www.ebi.ac.uk/biocat/ Does not appear to list itself... Last updated July 25, 2000