Upload
anthony-liang
View
215
Download
0
Embed Size (px)
Citation preview
8/22/2019 02 Databases.ppt
1/8
4/03/13
1
Storing sequencesDatabases and file formats
Reference
Zvelebil and Baum, UnderstandingBioinformatics chapter 3
Molecular biology databases Sequence databases
Annotated Low-annotation Specialized
Structural databases Motif databases Genome databases
Proteome databases RNA expression Literature Populations
Mutations
Polymorphisms Organisms Pathways
Databases- terminology Database- collection of information related
to a specific subject (e.g. a phone book) Record- an entry in a database (e.g. your
entry in the phone book) Field- a component of a record (e.g. youraddress & number)
Flat-file databases store data as
text filesFlat-file databases
Pros Easy to put together
and distribute No need for expensive
or complicated
database management
software
Cons Detailed targeted
searching is difficult Searching is not
efficient
8/22/2019 02 Databases.ppt
2/8
4/03/13
2
Relational databases contain
interconnected tablesRelational databases
Require aRelational Database ManagementSystem (RDBMS)
Queried using SQL (or more commonly, aGUI front-end)
SELECT protab1.protein-name, protab2.protein-sequenceFROM protab1, protab2WHERE protab1.protein-code = protab2.protein-codeAND protab1.protein-code = P1002;
Data in a database
Primary datae.g. DNA sequence, protein sequence, protein
3D structure coordinates Annotations
e.g. Authors, literature references, proteinfunction, organism of origin, location of coding
regions in DNA sequence etc
Database Record Structure A sequence database record contains both
sequence and annotations Record divided into 3 sections:
HeaderFeature tableSequence
GenBank Anatomy: HeaderLOCUS HUMSOMI 2667 bp DNA linear PRI 13-JAN-1995DEFINITION Human somatostatin I gene and flanks.ACCESSION J00306
VERSION J00306.1 GI:338287KEYWORDS neuropeptide Y; somatostatin; somatostatin I; somatostatin-14;
somatostatin-28.SOURCE Homo sapiens (human)ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1126 to 1368; 2246 to 2605)AUTHORS Shen,L.P., Pictet,R.L. and Rutter,W.J.TITLE Human somatostatin I: sequence of the cDNAJOURNAL Proc. Natl. Acad. Sci. U.S.A. 79 (15), 4575-4579 (1982)PUBMED 6126875
REFERENCE 2 (bases 1 to 2667)AUTHORS Shen,L.P. and Rutter,W.J.TITLE Sequence of the human somatostatin I geneJOURNAL Science 224 (4645), 168-171 (1984)PUBMED 6142531
COMMENT Original source text: Human fetal liver DNA, Charon 4A library,clone pHSI-1-2.7 [2], and pancreatic somatostatinoma tissue,
GenBank Anatomy: FeaturesFEATURES Location/Qualifiers
source 1..2667/organism="Homo sapiens"
/mol_type="genomic DNA"/db_xref="taxon:9606"/map="3q28"
prim_transcript 1126..2605/note="som I mRNA"
CDS join(1231..1368,2246..2458)/note="preprosomatostatin I"/codon_start=1/protein_id="AAA60566.1"/db_xref="GI:338288"/translation="MLSCRLQCALAALSIVLALGCVTGAPSDPRLRQFLQKSLAAAAGKQELAKYFLAELLSEPNQTENDALEPEDLSQAAEQDEMRLELQRSANSNPAMAPRERKAGCKNFFWKTFTSC"
sig_peptide 1231..1302/note="prosomatostatin I signal peptide"
mat_peptide 2372..2455/product="somatostatin-28 peptide"
mat_peptide 2414..2455/product="somatostatin-14 peptide"
gene 1231..1368/gene="SST"
exon
8/22/2019 02 Databases.ppt
3/8
4/03/13
3
GenBank Anatomy: SequenceORIGIN Chromosome 3q28; 1 bp upstream of EcoRI site.
1 gaattcaagg acaggttttc ttaaactttc tttgtttcta ggagatcagg cagagctgaa61 tttaaccaag aatcttttga tcctttccac atatagatat acaatagtgg tcacatatgt121 tctgggagtt cctagacctt atatgtctaa actggggctt cctgacataa aactatgctt181 accggcagga atctgttaga aaactcagag ctcagtagaa ggaacactgg ctttggaatg241 tggaggtctg gttttgctca aagtgtgcag tatgtgaagg agaacaattt actgaccatt301 actctgcctt actgattcaa attctgaggt ttattgaata atttcttaga ttgccttcca361 gctctaaatt tctcagcacc aaaatgaagt ccatttcaat ctctctctct ctctttccct421 cccgtacata tacacacact catacatata tatggtcaca atagaaaggc aggtagatca481 gaagtctcag ttgctgagaa agagggaggg agggtgagcc agagtacttc tcccccattg541 tagagaaaag tgaagttctt ttagagcccc gttacatctt caaggccttt tatgagataa601 tggaggaaat aaagagggct cagtccttct accgtccata tttcattctc aaatctgtta661 ttagaggaat gattctgatc tccacctacc atacacatgc cctgttgctt gttgggcctt721 acactaaaat gttagagtat gatgacagat ggagttgtct gggtacattt gtgtgcattt781 aagggtgata gtgtatttgc tctttaagag ctgagtgttt gagcctctgt ttgtgtgtaa841 ttgagtgtgc atgtgtggga gtgaaattgt ggaatgtgta tgctcatagc actgagtgaa901 aataaaagat tgtataaatc gtggggcatg tggaattgtg tgtgcctgtg cgtgtgcagt961 attttttttt ttttaagtaa gccactttag atcttgtcac ctcccctgtc ttctgtgatt1021 gattttgcga ggctaatggt gcgtaaaagg gctggtgaga tctgggggcg cctcctagcc1081 tgacgtcaga gagagagttt aaaacagagg gagacggttg agagcacaca agccgcttta1141 ggagcgaggt tcggagccat cgctgctgcc tgctgatccg cgcctagagt ttgaccagcc1201 actctccagc tcggctttcg cggcgccgag atgctgtcct gccgcctcca gtgcgcgctg1261 gctgcgctgt ccatcgtcct ggccctgggc tgtgtcaccg gcgctccctc ggaccccaga1321 ctccgtcagt ttctgcagaa gtccctggct gctgccgcgg ggaagcaggt aaggagactc1381 cctcgacgtc tcccggattc tccagccctc cctaagcctt gctcctgccc cattggtttg1441 gacgtaaggg atgctcagtc cttctaaaga gttttggtgc ttttctgggt ccctcagctc
Accessing sequence databases Searching the header
Searching the annotations for keywords (organism,gene name etc)
Searching the sequences Searching for sequences similar to a query sequence
using programs such as BLAST
Searching for sequences containing particular patterns
Using the right words: ontologies
MOLECULAR FUNCTION
Nucleic acid binding enzyme
DNA binding helicase
Adenosine
triphophatase
Chromatin binding
DNA helicase ATP-dependant
helicase
DNA-dependant
Adenosine triphosphatase
ATP-dependant DNA helicase
Finding databases for bioinformatics
Google is your friend (but be critical!)Nucleic Acids Research annual database
supplement
Nucleotide sequence repositories
Central repositories for all known publicnucleotide sequences
Annotations and sequences are entered andcurated by submittersQuality control issuesLack of consistency of annotations
8/22/2019 02 Databases.ppt
4/8
4/03/13
4
Nucleotide sequence repositories Main repositories:
GenBank (US) EMBL (Europe) DDBJ (Japan)
All 3 databases exchange data daily and shouldcontain the same sequences
Databases differ in their format and in the servicesthey offer for searching and submission
Genbank Currently maintained by the National
Center for Biotechnology Information (part
of the National Library of Medicine) in
Bethesda, MD Database available for download and
searching using Entrez and BLAST http://www.ncbi.nlm.nih.gov
EMBL Currently maintained by the European
Bioinformatics Institute in Hinxton, UK Available for download and search using
SRS, BLAST, fasta http://www.ebi.ac.uk
DDBJ (DNA database of Japan) National Institute of Genetics, Japan Available for download and search using
SRS, BLAST, fasta etc http://www.ddbj.nig.ac.jp/
Nucleotide sequence data
Genomic DNA (whole or partial genomes) cDNA and mRNA ESTs
Genbank divisions1. PRI - primate sequences2. ROD - rodent sequences3. MAM - other mammalian sequences4. VRT - other vertebrate sequences5. INV - invertebrate sequences6. PLN - plant, fungal, and algal sequences 7. BCT - bacterial sequences8. VRL - viral sequences 9. PHG - bacteriophage sequences10. SYN - synthetic sequences11. UNA - unannotated sequences12. EST - EST sequences (expressed sequence tags) 13. PAT - patent sequences14. STS - STS sequences (sequence tagged sites) 15. GSS - GSS sequences (genome survey sequences) 16.HTG - HTGS sequences (high throughput genomic sequences) 17. HTC - HTC sequences (high throughput cDNA sequences)
8/22/2019 02 Databases.ppt
5/8
4/03/13
5
EST Expressed Sequence Tags (ESTs) are short
(usually about 300-500 bp), single-passsequence reads from mRNA (cDNA).Typically they are produced in largebatches. They represent a snapshot of genesexpressed in a given tissue and/or at a givendevelopmental stage. They are tags (somecoding, others not) of expression for a givencDNA library.
LOCUS T12742 157 bp mRNA EST 28-OCT-1993
DEFINITION zEST00149-5 Zea mays cDNA clone csuh00149/umc382 5' end similar tosimilar to short chain alcohol dehydrogenase.
ACCESSION T12742NID g409680KEYWORDS EST.SOURCE Maize clone=csuh00149/umc382 library=Maize Leaf, Stratagene #937005
strain=B73 vector=Uni-ZAP primer=SK Rsite1=EcoR1 Rsite2=Xho1 mRNA
isolated from illuminated leaves and sheaths of 5 week old plant.cDNA directionally cloned into vector. .
ORGANISM Zea maysEucaryotae; Embryophyta; Magnoliophyta; Liliopsida; Cyperales;Poaceae; Zea.
REFERENCE 1 (bases 1 to 157)AUTHORS Baysdorfer,C.TITLE The Maize cDNA ProgramJOURNAL Unpublished (1993)
COMMENT
Contact: Baysdorfer CCalifornia State UniversityDept Biol Sci, California State Univ, Hayward, CA 94542Tel: 5108813459Fax: 5107272035Email: [email protected].
FEATURES Location/Qualifierssource 1..157
/organism="Zea mays"/clone="csuh00149/umc382"/strain="B73"
BASE COUNT 33 a 42 c 51 g 26 t 5 othersORIGIN
1 CCTCAAGGGC GTCGACNNNA TGCCCGAGGA CGTCGCCCAG GNNGTGCTCT51 ACCTGGCCAG CGACGAGGCG AGGTACGTCA GCGCGGTCAA CCTCATGGTG101 GACGGAGGCT TCACAGCCGT AAACAATAAC CTCAGGGCGT TTGAGGATTA151 GTTGAGG
EST
entry
Protein sequence databases Genbank proteins/TrEMBL SWISSPROT PIR NRL3D UniProt
Genbank protein, TrEMBL Translations of CDS features in Genbank or
EMBL Limited annotations (annotations are usually
the nucleotide annotations) Most up-to-date
LOCUS CYNPCP_1DEFINITION Cyanidium caldarium phycocyanin beta-subunit (cpcB) and cpcAgenes,
complete cds.NID g304585DATE 21-Apr-1996
ACCESSION L13467ORGANISM Cyanidium caldariumCOMMENT Nucleic Acid Features translated to generate this entry:
CDS 483..1001/gene="cpcB"/standard_name="phycocyanin"/codon_start=1/function="light harvesting"/evidence=experimental/product="phycocyanin beta subunit"/db_xref="PID:g304585"
AMINO CompleteCARBOXY CompleteCHECKSUM 137230
WEIGHT 18237.53LENGTH 172 aaORIGIN
Composition27 Ala A 7 Gln Q 14 Leu L 13 Ser S11 Arg R 7 Glu E 5 Lys K 9 Thr T8 Asn N 12 Gly G 6 Met M 0 Trp W12 Asp D 0 His H 4 Phe F 5 Tyr Y3 Cys C 12 Ile I 4 Pro P 13 Val V
Mol. Wt. Unmod. Chain = 18237.53 Number Of Residues = 1721 MLDAFAKVVA QADARGEFLS NTQLDALSKM VSEGNKRLDV VNRITSNASA51 IVTNAARALF SEQPQLIQPG GIAYTNRRMA ACLRDMEIIL RYVSYAIIAG101 DSSVLDDRCL NGLRETYQAL GVPGASVAVG IEKMKDSAIA IANDPSGITT151 GDCSALMAEV GTYFDRAATA VQ
GenbankProtein
entry
SWISSPROT Maintained at the Swiss Bioinformatics
Institute and EBI Manually curated: high-quality (but not
perfect) annotations Not as up to date http://www.expasy.org
8/22/2019 02 Databases.ppt
6/8
4/03/13
6
ID PHCA_GALSU STANDARD; PRT; 162 AA.AC P00306;DT 21-JUL-1986 (REL. 01, CREATED)DT 01-FEB-1996 (REL. 33, LAST SEQUENCE UPDATE)DT 01-FEB-1996 (REL. 33, LAST ANNOTATION UPDATE)DE C-PHYCOCYANIN ALPHA CHAIN.GN CPCA.OS GALDIERIA SULPHURARIA (CYANIDIUM CALDARIUM).OG CHLOROPLAST.OC EUKARYOTA; PLANTA; PHYCOPHYTA; RHODOPHYTA (RED ALGAE).RN [1]RP SEQUENCE FROM N.A.RC STRAIN=IIID2;RX MEDLINE; 95232204.RA TROXLER R.F., YAN Y., JIANG J.W., LIU B.;RL PLANT PHYSIOL. 107:985-994(1995).CC -!- FUNCTION: LIGHT-HARVESTING PHOTOSYNTHETIC BILE PIGMENT-PROTEINCC FROM THE PHYCOBILIPROTEIN COMPLEX.CC -!- SUBUNIT: HETERODIMER OF AN ALPHA AND A BETA CHAIN.CC -!- PTM: CONTAINS ONE COVALENTLY LINKED BILIN CHROMOPHORE.DR EMBL; L13467; G304586; -.DR EMBL; S77125; G998372; -.DR PIR; A00314; CFKKA.DR HSSP; P07122; 1CPC.KW PHYCOBILISOME; ELECTRON TRANSPORT; PHOTOSYNTHESIS; BILE PIGMENT;KW CHLOROPLAST.FT BINDING 84 84 PHYCOCYANOBILIN CHROMOPHORE.FT CONFLICT 61 61 S -> Q (IN REF. 3).FT CONFLICT 95 95 V -> I (IN REF. 3).FT CONFLICT 101 101 V -> A (IN REF. 3).SQ SEQUENCE 162 AA; 17505 MW; A4BF84C3 CRC32;
Composition26 Ala A 7 Gln Q 13 Leu L 12 Ser S8 Arg R 9 Glu E 5 Lys K 9 Thr T11 Asn N 12 Gly G 4 Met M 1 Trp W5 Asp D 1 His H 3 Phe F 12 Tyr Y2 Cys C 9 Ile I 6 Pro P 7 Val V
Mol. Wt. Unmod. Chain = 17505.47 Number Of Residues = 1621 MKTPITEAIA AADNQGRFLS NTELQAVNGR YQRAAASLEA ARSLTSNAER51 LINGAAQAVY SKFPYTSQMP GPQYASSAVG KAKCARDIGY YLRMVTYCLV101 VGGTGPMDEY LIAGLEEINR TFDLSPSWYV EALNYIKANH GLSGQAANEA151 NTYIDYAINA LS
SWISS-
PROT
entry
TrEMBL and SWISS-PROT
TrEMBL SP-TrEMBL
REM-TrEMBL
SWISS-PROT
EMBL(DNA)
Auto-translation
Short peptide fragmentsImmunoglobulinsT-Cell receptors
Patented sequencesSynthetic peptides
Non-protein DNA translations
others
Annotation
Automatic
sorting
UniProt Unified protein database incorporating PIR,
SWISS-PROT, TrEMBL http://www.uniprot.org
UniProt components UniProt Knowledgebase (UniProtKB)
central access point for extensive curated proteininformation, including function, classification, andcross-reference
UniProt Non-redundant Reference (UniRef) combines closely related sequences into a single record
to speed searches UniProt Archive (UniParc)
comprehensive repository, reflecting the history of allprotein sequences.
Specialized sequence databases Focus on a specific type of sequences Sequences are often modified or specially
annotated Usage depends on the database Examples:
Ribosomal RNA databases Immunology databases
Non-redundant databases Sequence data only: cannot be browsed, can only
be searched using a sequence Combine sequences from more than one database Identical duplicate sequences are removed Examples:
NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB
protein)
8/22/2019 02 Databases.ppt
7/8
4/03/13
7
Data submission and quality
Primary repositories provide tools to submitsequences
Much quality control is left to the submitter Some automatic quality control, but errors
sometimes creep in Human annotation takes time
RefSeq NCBI Reference Sequence Collection Non-redundant Validated data Format consistency Ongoing curation, automated and manual Distinct accession numbers:XX_NNNNNN
eg: NC_123456, XP_123456 Genomic DNA, transcript (RNA), and protein
products, for major research organisms
Some important points
Always use the latest version of thedatabase
Pay attention to accession and versionnumbers
Genbank growth
Live demo
Searching for sequences at NCBI Searching for sequences in Uniprot
Some reading
NCBI Handbook chapter 1http://www.ncbi.nlm.nih.gov/books/
bookres.fcgi/handbook/ch1.pdf
8/22/2019 02 Databases.ppt
8/8
4/03/13
8
Sequence file formats Sequences can be stored on a computer in
different formats/standards Different software packages will require
sequences to be stored in different formats Programs such as readseq can be used to
convert between formats
Sequence file formats:
Genbank formatLOCUS HSPPI 450 bp mRNA linear PRI 20-JUL-1993DEFINITION Homo sapiens mRNA for insulinoma pre-proinsulin.
ACCESSION X70508VERSION X70508.1 GI:394765KEYWORDS preproinsulin.SOURCE human.ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (ba ses 1 to 450)AUTHORS Chekhranova,M.K., Shuvalova,E.R., Kutin,A.M., Butnev,V.Iu.,
Valentsova,A.B., Il'ina,E.N. and Pankov,Iu.A.TITLE Cloning, primary structure determination and expression of
preproinsulin cDNA from human insulinoma in Escherichia coliJOURNAL Mol. Biol. (Mosk.) 26 (3), 596-600 (1992)
MEDLINE 93024361FEATURES Location/Qualifiers
source 1..450/organism="Homo sapiens"/db_xref="taxon:9606"/clone="pUEX1Ins12"/clone_lib="Human insulinoma cDNA library"
sig_peptide 45..80CDS 45..377
/codon_start=1/product="pre-proinsulin"/protein_id="CAA49913.1"/db_xref="GI:394766"/db_xref="SWISS-PROT:P01308"/translation="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
mat_peptide 78..374/product="pre-proinsulin"
BASE COUNT 86 a 152 c 136 g 76 tORIGIN
1 gctgcatcag aagaggccat caagcacatc actgtccttc tgccatggcc ctgtggatgc61 gcctcctgcc cctgctggcg ctgctggccc tctggggacc tgacccagcc gcagcctttg121 tgaaccaaca cctgtgcggc tcacacctgg tggaagctct ctacctagtg tgcggggaac181 gaggcttctt ctacacaccc aagacccgcc gggaggcaga ggacctgcag gtggggcagg241 tggagctggg cgggggccct ggtgcaggca gcctgcagcc cttggccctg gaggggtccc301 tgcagaagcg tggcattgtg gaacaatgct gtaccagcat ctgctccctc taccagctgg361 agaactactg caactagacg cagcccgcag gcagcccccc acccgccgcc tcctgcaccg421 agagagatgg aataaagccc ttgaaccagc
//
Sequence file formats: Fasta format>HSPPI 450 bp mRNA linear PRI 20-JUL-1993, 450 bases, 44C checksum.gctgcatcagaagaggccatcaagcacatcactgtccttctgccatggccctgtggatgcgcctcctgcccctgctggcgctgctggccctctggggacctgacccagccgcagcctttgtgaaccaacacctgtgcggctcacacctggtggaagctctctacctagtgtgcggggaacgaggcttcttctacacacccaagacccgccgggaggcagaggacctgcaggtggggcaggtggagctgggcgggggccctggtgcaggcagcctgcagcccttggccctggaggggtccctgcagaagcgtggcattgtggaacaatgctgtaccagcatctgctccctctaccagctggagaactactgcaactagacgcagcccgcaggcagccccccacccgccgcctcctgcaccgagagagatggaataaagcccttgaaccagc
Sequence file formats: GCG
formatHSPPIHSPPI 450 bp mRNA linear PRI 20-JUL-1993
HSPPI Length: 450 Jul 23, 2003 13:38 Check: 1100 ..1 gctgcatcag aagaggccat caagcacatc actgtccttc tgccatggcc51 ctgtggatgc gcctcctgcc cctgctggcg ctgctggccc tctggggacc101 tgacccagcc gcagcctttg tgaaccaaca cctgtgcggc tcacacctgg151 tggaagctct ctacctagtg tgcggggaac gaggcttctt ctacacaccc
201 aagacccgcc gggaggcaga ggacctgcag gtggggcagg tggagctggg251 cgggggccct ggtgcaggca gcctgcagcc cttggccctg gaggggtccc301 tgcagaagcg tggcattgtg gaacaatgct gtaccagcat ctgctccctc351 taccagctgg agaactactg caactagacg cagcccgcag gcagcccccc401 acccgccgcc tcctgcaccg agagagatgg aataaagccc ttgaaccagc