Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Introduction to Databases
Shifra Ben-DorIrit Orr
Lecture Outline
• Introduction– Data and Database types– Database components
• Data Formats• Sample databases• How to text search databases
What “units of information” do we dealwith in bioinformatics?
• DNA• RNA• Protein
• Sequence• Structure• Evolution
• Pathways• Interactions• Mutations
AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAGNucleotidesequence
Genes
mRNA
Proteinprimarysequence
Protein 3Dstructure
ProteinFunction
Acts as a tumor suppressor inmany tumor types. induces growtharrest or apoptosis depending on thephysiological circumstances or celltype, but both activities areinvolved in tumor suppression.
Involved in the transport ofchloride ions. Defects in CFTRare the cause of cystic fibrosis.It is the most common genetic diseasein the caucasian population, with aprevalence of about 1 in 2000 livebirths. cf, an autosomal recessivedisorder, is a common generalizeddisorder of exocrine gland function
SNPs
• What do we want from databases?
All of these have databases and toolsthat were created to work with them
Information retrieval fromsequence databases
Biological databases contain enormousamounts of data.
• Databases need to be well annotated.• Databases need to be easily searched.• Data found in databases should be easily
retrieved.• Data in databases should be in standard
formats.
Integrated Information Retrieval
• Many databases contain logical relations betweenspecific entries.
• One interface - connecting many biologicaldatabases.
• For example: a database that connects betweenprotein sequence, protein domain, proteinstructure and reference databases. (Interpro)
• Another example: Connection betweenreferences, protein sequence, DNA sequence, andstructure databases. (Entrez)
Slide provided by Dr. Vered Caspi
Core Data and Annotation
Databases generally have (at least) two types ofdata:
Core data: The data the database was generatedto organize
Annotation: Extra information that rounds outour picture of the core data
For example in a genome database, the sequenceis the core data, and the location of genes is theannotation
Database Issues
• Printed journals vs. databases• Direct submission to databases (e.g.
GenBank, GDB, PDB)• Archival vs. curated databases• Databases that publish experimental
results of large genomic centers.• Public vs. private databases.
For Example: Classification of Genomic Databases
Databasescope
Informationsource
Informationtype
Many genomesOne GenomeOne SubjectOne Gene
Direct submission from scientific communityScientific literatureGenome center’s experimental resultsOther databases
MappingSequence & annotationProtein structure & functionVariationsComparative genomicsgene networks
Slide provided by Dr. Vered Caspi
User Interface
• Database search– free text– field-specific– sequence-based
• Database output– text– graphics– dynamic
Data FormatsThere are many data formats used for
sequences (both nucleic and amino acid)
• Fasta Format• GenBank Format• EMBL Format• GCG Format
Fasta Format
• Simplest format• Least information• Starts with a > and sequence name
on one line• The sequence in plain text follows
>OB2T2GTGACAACATGTACAGCTGTGAGCGGTGTAAGAAGCTGCGGAACGGAGTGAAGTACTGCAAAGTCCTGCGGTTGCCCGAGATCCTGTGCATTCACCTAAAGCGCTTTCGGCACGAGGTGATGTACTCATTCAAGATCAACAGCCACGTCTCCTTGCCCTCGAGGGGCTCGACCTGCGCCCCTTCCTTGCCAAGGAGTGCACATCCCAGATCACCACCTACGACCTCCTCTCGGTCATCTGCCACCACGGCACGGCAGGCA
>TNRC_HUMAN P36941 (tumor necrosis factor c receptor)MLLPWATSAPGLAWGPLVLGLFGLLAASQPQAVPPYASENQTCRDQEKEYYEPQHRICCSRCPPGTYVSAKCSRIRDTVCATCAENSYNEHWNYLTICQLCRPCDPVMGLEEIAPCTSKRKTQCRCQPGMFCAAWALECTHCELLSDCPPGTEAELKDEVGKGNNHCVPCKAGHFQNTSSPSARCQPHTRCENQGLVEAAPGTAQSDTTCKNPLEPLPPEMSGTMLMLAVLLPLAFFLLLATVFSCIWKSHPSLCRKLGSLLKRRPQGEGPNPVAGSWEPPKAHPYFPDLVQPLLPISGDVSPVSTGLPAAPVLEAGVPQQQSPLDLTREPQLEPGEQSQVAHGTNGIHVTGGSMTITGNIYIYNGPVLGGPPGPGDLPATPEPPYPIPEEGDPGPPGLSTPHQEDGKAWHLAETEHCGATPSNRGPRNQFITHD>TNRC_MOUSE P50284 lymphotoxin-beta receptor precursorMRLPRASSPCGLAWGPLLLGLSGLLVASQPQLVPPYRIENQTCWDQDKEYYEPMHDVCCSRCPPGEFVFAVCSRSQDTVCKTCPHNSYNEHWNHLSTCQLCRPCDIVLGFEEVAPCTSDRKAECRCQPGMSCVYLDNECVHCEEERLVLCQPGTEAEVTDEIMDTDVNCVPCKPGHFQNTSSPRARCQPHTRCEIQGLVEAAPGTSYSDTICKNPPEPGAMLLLAILLSLVLFLLFTTVLACAWMRHPSLCRKLGTLLKRHPEGEESPPCPAPRADPHFPDLAEPLLPMSGDLSPSPAGPPTAPSLEEVVLQQQSPLVQARELEAEPGEHGQVAHGANGIHVTGGSVTVTGNIYIYNGPVLGGTRGPGDPPAPPEPPYPTPEEGAPGPSELSTPYQEDGKAWHLAETETLGCQDL>TNR1_RAT P22934 tumor necrosis factor receptor 1 precursor (p60)MGLPIVPGLLLSLVLLALLMGIHPSGVTGLVPSLGDREKRDNLCPQGKYAHPKNNSICCTKCHKGTYLVSDCPSPGQETVCEVCDKGTFTASQNHVRQCLSCKTCRKEMFQVEISPCKADMDTVCGCKKNQFQRYLSETHFQCVDCSPCFNGTVTIPCKEKQNTVCNCHAGFFLSGNECTPCSHCKKNQECMKLCLPPVANVTNPQDSGTAVLLPLVIFLGLCLLFFICISLLCRYPQWRPRVYSIICRDSAPVKEVEGEGIVTKPLTPASIPAFSPNPGFNPTLGFSTTPRFSHPVSSTPISPVFGPSNWHNFVPPVREVVPTQGADPLLYGSLNPVPIPAPVRKWEDVVAAQPQRLDTADPAMLYAVVDGVPPTRWKEFMRLLGLSEHEIERLELQNGRCLREAHYSMLEAWRRRTPRHEATLDVVGRVLCDMNLRGCLENIRETLESPAHSSTTHLPR
Genbank sequence format
NM_000394. Homo sapiens crys...[gi:14043059]
LOCUS NM_000394 1114 bp mRNA PRI 15-MAY-2001DEFINITION Homo sapiens crystallin, alpha A (CRYAA), mRNA.ACCESSION NM_000394VERSION NM_000394.2 GI:14043059KEYWORDS .SOURCE human.ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata;Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;Hominidae; Homo.REFERENCE 1 (bases 1 to 1114) AUTHORS Jaworski,C.J. and Piatigorsky,J. TITLE A pseudo-exon in the functional human alpha A-crystallin gene
Genbank sequence format
JOURNAL Nature 337 (6209), 752-754 (1989) MEDLINE 89143747PUBMED 2918909REFERENCE 2 (bases 1 to 1114) AUTHORS Jaworski,C.J.TITLE A reassessment of mammalian alpha A-crystallinsequences using DNA sequencing: implications for anthropoidaffinities of tarsier
FEATURES Location/Qualifiers source 1..1114 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="21" /map="21q22.3" gene 1..1114 /gene="CRYAA" /note="CRYA1" /db_xref="LocusID:1409" /db_xref="MIM:123580" misc_feature 70..234 /note="crystallin; Region: Alphacrystallin A chain" CDS 70..591 /gene="CRYAA" /note="human alphaA-crystallin;crystallin, alpha-1" /codon_start=1
/db_xref="LocusID:1409" /db_xref="MIM:123580" /product="crystallin, alpha A"
/protein_id="NP_000385.1" /db_xref="GI:4503055"
/translation="MDVTIQHPWFKRTLGPFYPSRLFDQFFGEGLFEYDLLPFL SSTISPYYRQSLFRTVLDSGISEVRSDRDKFVIFLDVKHFSP EDLTVKVQDDFVEIHGKHNERQDDHGYISREFHRRYRLPS NVDQSALSCSLSADGMLTFCGPKIQTGLDATHAERAIPVSR EEKPTSAPSS" misc_feature 244..555 /note="HSP20; Region: Hsp20/alpha crystallin family" polyA_signal 1092..1097
BASE COUNT 183 a 400 c 309 g 222 tORIGIN 1 acactgcgct gcccagaggc cccgctgact cctgccagcc tccaggtccc cgtggtacca 61 aagctgaaca tggacgtgac catccagcac ccctggttca agcgcaccct ggggcccttc 121 taccccagcc ggctgttcga ccagtttttc ggcgagggcc tttttgagta tgacctgctg 181 cccttcctgt cgtccaccat cagcccctac taccgccagt ccctcttccg caccgtgctg 241 gactccggca tctctgaggt tcgatccgac cgggacaagt tcgtcatctt cctcgatgtg 301 aagcacttct ccccggagga cctcaccgtg aaggtgcagg acgactttgt ggagatccac 361 ggaaagcaca acgagcgcca ggacgaccac ggctacattt cccgtgagtt ccaccgccgc 421 taccgcctgc cgtccaacgt ggaccagtcg gccctctctt gctccctgtc tgccgatggc 481 atgctgacct tctgtggccc caagatccag actggcctgg atgccaccca cgccgagcga 541 gccatccccg tgtcgcggga ggagaagccc acctcggctc cctcgtccta agcaggcatt 601 gcctcggctg gctcccctgc agccctggcc catcatgggg ggagcaccct gagggcgggg 661 tgtctgtctt cctttgcttc ccttttttcc tttccacctt ctcacatgga atgagggttt 721 gagagagcag ccaggagagc ttagggtctc agggtgtccc agaccccgac accggccagt 781 ggcggaagtg accgcacctc acactccttt agatagcagc ctggctcccc tggggtgcag 841 gcgcctcaac tctgctgagg gtccagaagg agggggtgac ctccggccag gtgcctcctg 901 acacacctgc agcctccctc cgcggcgggc cctgcccaca cctcctgggg cgcgtgaggc 961 ccgtggggcc ggggcttctg tgcacctggg ctctcgcggc ctcttctctc agaccgtctt 1021 cctccaaccc ctctatgtag tgccgctctt ggggacatgg gtcgcccatg agagcgcagc 1081 ccgcggcaat caataaacag caggtgatac aagc//Revised: October 24, 2001.
EMBL sequence format
ID A4279484 standard; DNA; FUN; 581 BP.
XX
AC AJ279484;
XX
SV AJ279484.1
XX
DT 14-JAN-2000 (Rel. 62, Created)
DT 14-JAN-2000 (Rel. 62, Last updated, Version 2)
XX
DE Unidentified ascomycota sp. 4/97-9 5.8S rRNA gene and ITS 1 and 2
XX
KW 5.8S ribosomal RNA; 5.8S rRNA gene; internal transcribed spacer 1;
EMBL sequence formatKW internal transcribed spacer 2; ITS1; ITS2.
XX
OS ascomycota sp. 4/97-9
OC Eukaryota; Fungi; Ascomycota.
XX
RN [1]
RP 1-581
RA Wirsel S.G.R.;
RT ;
RL Submitted (21-DEC-1999) to the EMBL/GenBank/DDBJ databases.
RL Wirsel S.G.R., Fakultaet fuer Biologie, Universitaet Konstanz,
RL Universitaetsstr. 10, Konstanz 78434, Germany.
XX
EMBL sequence format
RN [2]
RA Wirsel S.G.R., Leibinger W., Mendgen K.W.;
RT "Genetic diversity of fungi associated with common reed (Phragmites
RT australis)";
RL Unpublished.
XX
FH Key Location/Qualifiers
FH
FT source 1..581
FT /db_xref="taxon:112223"
FT /organism="ascomycota sp. 4/97-9"
FT /isolate="4/97-9"
EMBL sequence format
FT misc_feature 64..226
FT /note="internal transcribed spacer 1, ITS1"
FT rRNA 227..385
FT /gene="5.8S rRNA"
FT /product="5.8S ribosomal RNA"
FT misc_feature 386..529
FT /note="internal transcribed spacer 2, ITS2"
XX
SQ Sequence 581 BP; 132 A; 164 C; 145 G; 140 T; 0 other;
ccatttagag gaagtaaaag tcgtaacaag gtctccgttg gtgaaccagggagggatc 60
ttacgagagt gtcaccactc ccaacccact gtttacctac ccgtccaccg tgcttcggca 120
ggcagtcctg tgggacaggg cctcgccccc ctccgggggg tgcctgccgc
EMBL entry
• Each line in the entry begins with a two-character line code, whichindicates the type of information contained in the line.
• The currently used line types, along with their respective line codes,are listed below:
• ID - identification (begins each entry; 1 per entry)
• AC - accession number (>=1 per entry)
• SV - sequence version (1 per entry)
• DT - date (2 per entry)
• DE - description (>=1 per entry)
• KW - keyword (>=1 per entry)
EMBL entry
• OS - organism species (>=1 per entry)
• OC - organism classification (>=1 per entry)
• OG - organelle (0 or 1 per entry)
• RN - reference number (>=1 per entry)
• RC - reference comment (>=0 per entry)
• RP - reference positions (>=1 per entry)
• RX - reference cross-reference (>=0 per entry)
• RA - reference author(s) (>=1 per entry)
• RT - reference title (>=1 per entry)
• RL - reference location (>=1 per entry)
• DR - database cross-reference (>=0 per entry)
EMBL entry
• FH - feature table header (0 or 2 per entry)
• FT - feature table data (>=0 per entry)
• CC - comments or notes (>=0 per entry)
• XX - spacer line (many per entry)
• SQ - sequence header (1 per entry)
• bb - (blanks) sequence data (>=1 per entry)
• // - termination line (ends each entry; 1 per
entry )
GCG Format
• Has space for comments and spacefor data, separated by two dots ..
• Can contain full sequence data likeGenBank or EMBL
• Has a minimum of sequence name,length, date, type (nucleic or aminoacid) and checksum
!!NA_SEQUENCE 1.0
5B3.seq Length: 744 March 18, 1999 10:43 Type: N Check: 2586 ..
1 TCTAGAGGAG AYATYGTWAT GACCCAGTCT CCATCCTCCC TGAGTGTGTC
51 AGCAGGAGAG AAGGTCACTA TGAGCTGCAA GTCCAGTCAG AGTCTGTTAA
101 ACAGTAGAAA TCAAAAGAAC TACTTGGCCT GGTACCAGCA GAAACCAGGA
151 CAGCCTCCTA AACTTTTGAT CTACGGGGTA TTTATTAGGG ATTCTGGGGT
201 CCCTGATCGC TTCACAGGCA GTGGATCTGG AACCGATTTC ACTCTTACCA
251 TCAGCAGTGT GCAGGCTGAA GACCTGGCAG TTTATTACTG TCAGAATGAT
301 CATATTTATC CGTACACGTT CGGAGGGGGC ACWAAGCTGG AAATTAAAGG
351 GTCGACTTCC GGTAGCGGCA AATCCTCTGA AGGCAAAGGT SAGGTSCAGC
401 TGCAGGAGTC TGGACCTGGC CTGGTGAAGC CTTCCCAGTC TCTGTCCCTC
451 ACCTGCTCTG TCACTGGTTA CTCAATCACC AGTGGTTATG CCTGGAACTG
501 GATCCGGCAG TTTCCAGGAA ACAAACTGGA GTGGATGGGC TACATAAGCT
551 ACAGTGGTTT CACTAGCTAC AACCCATCTC TCAGAAGTCG AATCTCTTTC