02 Databases.ppt

8/22/2019 02 Databases.ppt

1/8

4/03/13

1

Storing sequencesDatabases and file formats

Reference

Zvelebil and Baum, UnderstandingBioinformatics chapter 3

Molecular biology databases Sequence databases

Annotated Low-annotation Specialized

Structural databases Motif databases Genome databases

Proteome databases RNA expression Literature Populations

Mutations

Polymorphisms Organisms Pathways

Databases- terminology Database- collection of information related

to a specific subject (e.g. a phone book) Record- an entry in a database (e.g. your

entry in the phone book) Field- a component of a record (e.g. youraddress & number)

Flat-file databases store data as

text filesFlat-file databases

Pros Easy to put together

and distribute No need for expensive

or complicated

database management

software

Cons Detailed targeted

searching is difficult Searching is not

efficient


2/8

4/03/13

2

Relational databases contain

interconnected tablesRelational databases

Require aRelational Database ManagementSystem (RDBMS)

Queried using SQL (or more commonly, aGUI front-end)

SELECT protab1.protein-name, protab2.protein-sequenceFROM protab1, protab2WHERE protab1.protein-code = protab2.protein-codeAND protab1.protein-code = P1002;

Data in a database

Primary datae.g. DNA sequence, protein sequence, protein

3D structure coordinates Annotations

e.g. Authors, literature references, proteinfunction, organism of origin, location of coding

regions in DNA sequence etc

Database Record Structure A sequence database record contains both

sequence and annotations Record divided into 3 sections:

HeaderFeature tableSequence

GenBank Anatomy: HeaderLOCUS HUMSOMI 2667 bp DNA linear PRI 13-JAN-1995DEFINITION Human somatostatin I gene and flanks.ACCESSION J00306

VERSION J00306.1 GI:338287KEYWORDS neuropeptide Y; somatostatin; somatostatin I; somatostatin-14;

somatostatin-28.SOURCE Homo sapiens (human)ORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1126 to 1368; 2246 to 2605)AUTHORS Shen,L.P., Pictet,R.L. and Rutter,W.J.TITLE Human somatostatin I: sequence of the cDNAJOURNAL Proc. Natl. Acad. Sci. U.S.A. 79 (15), 4575-4579 (1982)PUBMED 6126875

REFERENCE 2 (bases 1 to 2667)AUTHORS Shen,L.P. and Rutter,W.J.TITLE Sequence of the human somatostatin I geneJOURNAL Science 224 (4645), 168-171 (1984)PUBMED 6142531

COMMENT Original source text: Human fetal liver DNA, Charon 4A library,clone pHSI-1-2.7 [2], and pancreatic somatostatinoma tissue,

GenBank Anatomy: FeaturesFEATURES Location/Qualifiers

source 1..2667/organism="Homo sapiens"

/mol_type="genomic DNA"/db_xref="taxon:9606"/map="3q28"

prim_transcript 1126..2605/note="som I mRNA"

CDS join(1231..1368,2246..2458)/note="preprosomatostatin I"/codon_start=1/protein_id="AAA60566.1"/db_xref="GI:338288"/translation="MLSCRLQCALAALSIVLALGCVTGAPSDPRLRQFLQKSLAAAAGKQELAKYFLAELLSEPNQTENDALEPEDLSQAAEQDEMRLELQRSANSNPAMAPRERKAGCKNFFWKTFTSC"

sig_peptide 1231..1302/note="prosomatostatin I signal peptide"

mat_peptide 2372..2455/product="somatostatin-28 peptide"

mat_peptide 2414..2455/product="somatostatin-14 peptide"

gene 1231..1368/gene="SST"

exon


3/8

4/03/13

3

GenBank Anatomy: SequenceORIGIN Chromosome 3q28; 1 bp upstream of EcoRI site.

1 gaattcaagg acaggttttc ttaaactttc tttgtttcta ggagatcagg cagagctgaa61 tttaaccaag aatcttttga tcctttccac atatagatat acaatagtgg tcacatatgt121 tctgggagtt cctagacctt atatgtctaa actggggctt cctgacataa aactatgctt181 accggcagga atctgttaga aaactcagag ctcagtagaa ggaacactgg ctttggaatg241 tggaggtctg gttttgctca aagtgtgcag tatgtgaagg agaacaattt actgaccatt301 actctgcctt actgattcaa attctgaggt ttattgaata atttcttaga ttgccttcca361 gctctaaatt tctcagcacc aaaatgaagt ccatttcaat ctctctctct ctctttccct421 cccgtacata tacacacact catacatata tatggtcaca atagaaaggc aggtagatca481 gaagtctcag ttgctgagaa agagggaggg agggtgagcc agagtacttc tcccccattg541 tagagaaaag tgaagttctt ttagagcccc gttacatctt caaggccttt tatgagataa601 tggaggaaat aaagagggct cagtccttct accgtccata tttcattctc aaatctgtta661 ttagaggaat gattctgatc tccacctacc atacacatgc cctgttgctt gttgggcctt721 acactaaaat gttagagtat gatgacagat ggagttgtct gggtacattt gtgtgcattt781 aagggtgata gtgtatttgc tctttaagag ctgagtgttt gagcctctgt ttgtgtgtaa841 ttgagtgtgc atgtgtggga gtgaaattgt ggaatgtgta tgctcatagc actgagtgaa901 aataaaagat tgtataaatc gtggggcatg tggaattgtg tgtgcctgtg cgtgtgcagt961 attttttttt ttttaagtaa gccactttag atcttgtcac ctcccctgtc ttctgtgatt1021 gattttgcga ggctaatggt gcgtaaaagg gctggtgaga tctgggggcg cctcctagcc1081 tgacgtcaga gagagagttt aaaacagagg gagacggttg agagcacaca agccgcttta1141 ggagcgaggt tcggagccat cgctgctgcc tgctgatccg cgcctagagt ttgaccagcc1201 actctccagc tcggctttcg cggcgccgag atgctgtcct gccgcctcca gtgcgcgctg1261 gctgcgctgt ccatcgtcct ggccctgggc tgtgtcaccg gcgctccctc ggaccccaga1321 ctccgtcagt ttctgcagaa gtccctggct gctgccgcgg ggaagcaggt aaggagactc1381 cctcgacgtc tcccggattc tccagccctc cctaagcctt gctcctgccc cattggtttg1441 gacgtaaggg atgctcagtc cttctaaaga gttttggtgc ttttctgggt ccctcagctc

Accessing sequence databases Searching the header

Searching the annotations for keywords (organism,gene name etc)

Searching the sequences Searching for sequences similar to a query sequence

using programs such as BLAST

Searching for sequences containing particular patterns

Using the right words: ontologies

MOLECULAR FUNCTION

Nucleic acid binding enzyme

DNA binding helicase

Adenosine

triphophatase

Chromatin binding

DNA helicase ATP-dependant

helicase

DNA-dependant

Adenosine triphosphatase

ATP-dependant DNA helicase

Finding databases for bioinformatics

Google is your friend (but be critical!)Nucleic Acids Research annual database

supplement

Nucleotide sequence repositories

Central repositories for all known publicnucleotide sequences

Annotations and sequences are entered andcurated by submittersQuality control issuesLack of consistency of annotations


4/8

4/03/13

4

Nucleotide sequence repositories Main repositories:

GenBank (US) EMBL (Europe) DDBJ (Japan)

All 3 databases exchange data daily and shouldcontain the same sequences

Databases differ in their format and in the servicesthey offer for searching and submission

Genbank Currently maintained by the National

Center for Biotechnology Information (part

of the National Library of Medicine) in

Bethesda, MD Database available for download and

searching using Entrez and BLAST http://www.ncbi.nlm.nih.gov

EMBL Currently maintained by the European

Bioinformatics Institute in Hinxton, UK Available for download and search using

SRS, BLAST, fasta http://www.ebi.ac.uk

DDBJ (DNA database of Japan) National Institute of Genetics, Japan Available for download and search using

SRS, BLAST, fasta etc http://www.ddbj.nig.ac.jp/

Nucleotide sequence data

Genomic DNA (whole or partial genomes) cDNA and mRNA ESTs

Genbank divisions1. PRI - primate sequences2. ROD - rodent sequences3. MAM - other mammalian sequences4. VRT - other vertebrate sequences5. INV - invertebrate sequences6. PLN - plant, fungal, and algal sequences 7. BCT - bacterial sequences8. VRL - viral sequences 9. PHG - bacteriophage sequences10. SYN - synthetic sequences11. UNA - unannotated sequences12. EST - EST sequences (expressed sequence tags) 13. PAT - patent sequences14. STS - STS sequences (sequence tagged sites) 15. GSS - GSS sequences (genome survey sequences) 16.HTG - HTGS sequences (high throughput genomic sequences) 17. HTC - HTC sequences (high throughput cDNA sequences)


5/8

4/03/13

5

EST Expressed Sequence Tags (ESTs) are short

(usually about 300-500 bp), single-passsequence reads from mRNA (cDNA).Typically they are produced in largebatches. They represent a snapshot of genesexpressed in a given tissue and/or at a givendevelopmental stage. They are tags (somecoding, others not) of expression for a givencDNA library.

LOCUS T12742 157 bp mRNA EST 28-OCT-1993

DEFINITION zEST00149-5 Zea mays cDNA clone csuh00149/umc382 5' end similar tosimilar to short chain alcohol dehydrogenase.

ACCESSION T12742NID g409680KEYWORDS EST.SOURCE Maize clone=csuh00149/umc382 library=Maize Leaf, Stratagene #937005

strain=B73 vector=Uni-ZAP primer=SK Rsite1=EcoR1 Rsite2=Xho1 mRNA

isolated from illuminated leaves and sheaths of 5 week old plant.cDNA directionally cloned into vector. .

ORGANISM Zea maysEucaryotae; Embryophyta; Magnoliophyta; Liliopsida; Cyperales;Poaceae; Zea.

REFERENCE 1 (bases 1 to 157)AUTHORS Baysdorfer,C.TITLE The Maize cDNA ProgramJOURNAL Unpublished (1993)

COMMENT

Contact: Baysdorfer CCalifornia State UniversityDept Biol Sci, California State Univ, Hayward, CA 94542Tel: 5108813459Fax: 5107272035Email: [email protected].

FEATURES Location/Qualifierssource 1..157

/organism="Zea mays"/clone="csuh00149/umc382"/strain="B73"

BASE COUNT 33 a 42 c 51 g 26 t 5 othersORIGIN

1 CCTCAAGGGC GTCGACNNNA TGCCCGAGGA CGTCGCCCAG GNNGTGCTCT51 ACCTGGCCAG CGACGAGGCG AGGTACGTCA GCGCGGTCAA CCTCATGGTG101 GACGGAGGCT TCACAGCCGT AAACAATAAC CTCAGGGCGT TTGAGGATTA151 GTTGAGG

EST

entry

Protein sequence databases Genbank proteins/TrEMBL SWISSPROT PIR NRL3D UniProt

Genbank protein, TrEMBL Translations of CDS features in Genbank or

EMBL Limited annotations (annotations are usually

the nucleotide annotations) Most up-to-date

LOCUS CYNPCP_1DEFINITION Cyanidium caldarium phycocyanin beta-subunit (cpcB) and cpcAgenes,

complete cds.NID g304585DATE 21-Apr-1996

ACCESSION L13467ORGANISM Cyanidium caldariumCOMMENT Nucleic Acid Features translated to generate this entry:

CDS 483..1001/gene="cpcB"/standard_name="phycocyanin"/codon_start=1/function="light harvesting"/evidence=experimental/product="phycocyanin beta subunit"/db_xref="PID:g304585"

AMINO CompleteCARBOXY CompleteCHECKSUM 137230

WEIGHT 18237.53LENGTH 172 aaORIGIN

Composition27 Ala A 7 Gln Q 14 Leu L 13 Ser S11 Arg R 7 Glu E 5 Lys K 9 Thr T8 Asn N 12 Gly G 6 Met M 0 Trp W12 Asp D 0 His H 4 Phe F 5 Tyr Y3 Cys C 12 Ile I 4 Pro P 13 Val V

Mol. Wt. Unmod. Chain = 18237.53 Number Of Residues = 1721 MLDAFAKVVA QADARGEFLS NTQLDALSKM VSEGNKRLDV VNRITSNASA51 IVTNAARALF SEQPQLIQPG GIAYTNRRMA ACLRDMEIIL RYVSYAIIAG101 DSSVLDDRCL NGLRETYQAL GVPGASVAVG IEKMKDSAIA IANDPSGITT151 GDCSALMAEV GTYFDRAATA VQ

GenbankProtein

entry

SWISSPROT Maintained at the Swiss Bioinformatics

Institute and EBI Manually curated: high-quality (but not

perfect) annotations Not as up to date http://www.expasy.org


6/8

4/03/13

6

ID PHCA_GALSU STANDARD; PRT; 162 AA.AC P00306;DT 21-JUL-1986 (REL. 01, CREATED)DT 01-FEB-1996 (REL. 33, LAST SEQUENCE UPDATE)DT 01-FEB-1996 (REL. 33, LAST ANNOTATION UPDATE)DE C-PHYCOCYANIN ALPHA CHAIN.GN CPCA.OS GALDIERIA SULPHURARIA (CYANIDIUM CALDARIUM).OG CHLOROPLAST.OC EUKARYOTA; PLANTA; PHYCOPHYTA; RHODOPHYTA (RED ALGAE).RN [1]RP SEQUENCE FROM N.A.RC STRAIN=IIID2;RX MEDLINE; 95232204.RA TROXLER R.F., YAN Y., JIANG J.W., LIU B.;RL PLANT PHYSIOL. 107:985-994(1995).CC -!- FUNCTION: LIGHT-HARVESTING PHOTOSYNTHETIC BILE PIGMENT-PROTEINCC FROM THE PHYCOBILIPROTEIN COMPLEX.CC -!- SUBUNIT: HETERODIMER OF AN ALPHA AND A BETA CHAIN.CC -!- PTM: CONTAINS ONE COVALENTLY LINKED BILIN CHROMOPHORE.DR EMBL; L13467; G304586; -.DR EMBL; S77125; G998372; -.DR PIR; A00314; CFKKA.DR HSSP; P07122; 1CPC.KW PHYCOBILISOME; ELECTRON TRANSPORT; PHOTOSYNTHESIS; BILE PIGMENT;KW CHLOROPLAST.FT BINDING 84 84 PHYCOCYANOBILIN CHROMOPHORE.FT CONFLICT 61 61 S -> Q (IN REF. 3).FT CONFLICT 95 95 V -> I (IN REF. 3).FT CONFLICT 101 101 V -> A (IN REF. 3).SQ SEQUENCE 162 AA; 17505 MW; A4BF84C3 CRC32;

Composition26 Ala A 7 Gln Q 13 Leu L 12 Ser S8 Arg R 9 Glu E 5 Lys K 9 Thr T11 Asn N 12 Gly G 4 Met M 1 Trp W5 Asp D 1 His H 3 Phe F 12 Tyr Y2 Cys C 9 Ile I 6 Pro P 7 Val V

Mol. Wt. Unmod. Chain = 17505.47 Number Of Residues = 1621 MKTPITEAIA AADNQGRFLS NTELQAVNGR YQRAAASLEA ARSLTSNAER51 LINGAAQAVY SKFPYTSQMP GPQYASSAVG KAKCARDIGY YLRMVTYCLV101 VGGTGPMDEY LIAGLEEINR TFDLSPSWYV EALNYIKANH GLSGQAANEA151 NTYIDYAINA LS

SWISS-

PROT

entry

TrEMBL and SWISS-PROT

TrEMBL SP-TrEMBL

REM-TrEMBL

SWISS-PROT

EMBL(DNA)

Auto-translation

Short peptide fragmentsImmunoglobulinsT-Cell receptors

Patented sequencesSynthetic peptides

Non-protein DNA translations

others

Annotation

Automatic

sorting

UniProt Unified protein database incorporating PIR,

SWISS-PROT, TrEMBL http://www.uniprot.org

UniProt components UniProt Knowledgebase (UniProtKB)

central access point for extensive curated proteininformation, including function, classification, andcross-reference

UniProt Non-redundant Reference (UniRef) combines closely related sequences into a single record

to speed searches UniProt Archive (UniParc)

comprehensive repository, reflecting the history of allprotein sequences.

Specialized sequence databases Focus on a specific type of sequences Sequences are often modified or specially

annotated Usage depends on the database Examples:

Ribosomal RNA databases Immunology databases

Non-redundant databases Sequence data only: cannot be browsed, can only

be searched using a sequence Combine sequences from more than one database Identical duplicate sequences are removed Examples:

NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB

protein)


7/8

4/03/13

7

Data submission and quality

Primary repositories provide tools to submitsequences

Much quality control is left to the submitter Some automatic quality control, but errors

sometimes creep in Human annotation takes time

RefSeq NCBI Reference Sequence Collection Non-redundant Validated data Format consistency Ongoing curation, automated and manual Distinct accession numbers:XX_NNNNNN

eg: NC_123456, XP_123456 Genomic DNA, transcript (RNA), and protein

products, for major research organisms

Some important points

Always use the latest version of thedatabase

Pay attention to accession and versionnumbers

Genbank growth

Live demo

Searching for sequences at NCBI Searching for sequences in Uniprot

Some reading

NCBI Handbook chapter 1http://www.ncbi.nlm.nih.gov/books/

bookres.fcgi/handbook/ch1.pdf


8/8

4/03/13

8

Sequence file formats Sequences can be stored on a computer in

different formats/standards Different software packages will require

sequences to be stored in different formats Programs such as readseq can be used to

convert between formats

Sequence file formats:

Genbank formatLOCUS HSPPI 450 bp mRNA linear PRI 20-JUL-1993DEFINITION Homo sapiens mRNA for insulinoma pre-proinsulin.

ACCESSION X70508VERSION X70508.1 GI:394765KEYWORDS preproinsulin.SOURCE human.ORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (ba ses 1 to 450)AUTHORS Chekhranova,M.K., Shuvalova,E.R., Kutin,A.M., Butnev,V.Iu.,

Valentsova,A.B., Il'ina,E.N. and Pankov,Iu.A.TITLE Cloning, primary structure determination and expression of

preproinsulin cDNA from human insulinoma in Escherichia coliJOURNAL Mol. Biol. (Mosk.) 26 (3), 596-600 (1992)

MEDLINE 93024361FEATURES Location/Qualifiers

source 1..450/organism="Homo sapiens"/db_xref="taxon:9606"/clone="pUEX1Ins12"/clone_lib="Human insulinoma cDNA library"

sig_peptide 45..80CDS 45..377

/codon_start=1/product="pre-proinsulin"/protein_id="CAA49913.1"/db_xref="GI:394766"/db_xref="SWISS-PROT:P01308"/translation="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"

mat_peptide 78..374/product="pre-proinsulin"

BASE COUNT 86 a 152 c 136 g 76 tORIGIN

1 gctgcatcag aagaggccat caagcacatc actgtccttc tgccatggcc ctgtggatgc61 gcctcctgcc cctgctggcg ctgctggccc tctggggacc tgacccagcc gcagcctttg121 tgaaccaaca cctgtgcggc tcacacctgg tggaagctct ctacctagtg tgcggggaac181 gaggcttctt ctacacaccc aagacccgcc gggaggcaga ggacctgcag gtggggcagg241 tggagctggg cgggggccct ggtgcaggca gcctgcagcc cttggccctg gaggggtccc301 tgcagaagcg tggcattgtg gaacaatgct gtaccagcat ctgctccctc taccagctgg361 agaactactg caactagacg cagcccgcag gcagcccccc acccgccgcc tcctgcaccg421 agagagatgg aataaagccc ttgaaccagc

//

Sequence file formats: Fasta format>HSPPI 450 bp mRNA linear PRI 20-JUL-1993, 450 bases, 44C checksum.gctgcatcagaagaggccatcaagcacatcactgtccttctgccatggccctgtggatgcgcctcctgcccctgctggcgctgctggccctctggggacctgacccagccgcagcctttgtgaaccaacacctgtgcggctcacacctggtggaagctctctacctagtgtgcggggaacgaggcttcttctacacacccaagacccgccgggaggcagaggacctgcaggtggggcaggtggagctgggcgggggccctggtgcaggcagcctgcagcccttggccctggaggggtccctgcagaagcgtggcattgtggaacaatgctgtaccagcatctgctccctctaccagctggagaactactgcaactagacgcagcccgcaggcagccccccacccgccgcctcctgcaccgagagagatggaataaagcccttgaaccagc

Sequence file formats: GCG

formatHSPPIHSPPI 450 bp mRNA linear PRI 20-JUL-1993

HSPPI Length: 450 Jul 23, 2003 13:38 Check: 1100 ..1 gctgcatcag aagaggccat caagcacatc actgtccttc tgccatggcc51 ctgtggatgc gcctcctgcc cctgctggcg ctgctggccc tctggggacc101 tgacccagcc gcagcctttg tgaaccaaca cctgtgcggc tcacacctgg151 tggaagctct ctacctagtg tgcggggaac gaggcttctt ctacacaccc

201 aagacccgcc gggaggcaga ggacctgcag gtggggcagg tggagctggg251 cgggggccct ggtgcaggca gcctgcagcc cttggccctg gaggggtccc301 tgcagaagcg tggcattgtg gaacaatgct gtaccagcat ctgctccctc351 taccagctgg agaactactg caactagacg cagcccgcag gcagcccccc401 acccgccgcc tcctgcaccg agagagatgg aataaagccc ttgaaccagc

Documents

02 Databases.ppt