58
Biological Databases

Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Biological Databases

Page 2: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

What  will  we  discuss  today?  •  Why  are  databases  the  backbone  of  bioinforma7cs?  •  The  basic  structure  of  a  database  •  Data  versus  annota7on  •  Types  of  DBs:  Genbank,  PubMed  and  NCBI  •  Query  strategies  •  Quality  of  data    

 issues  

http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/

Page 3: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Biologists  Collect  Lots  of  Data  

•  Hundreds  of  thousands  of  species  •  Millions  of  ar7cles  in  scien7fic  journals  •  Gene7c  informa7on:    

–  gene  names  (thousands)  –  phenotype  of  mutants  (infinite?)  –  loca7on  of  genes/muta7ons  on  chromosmes  –  linkage  (distances  between  genes)  

Page 4: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

•  High  Throughput    technology  – Rapid  inexpensive  DNA  sequencing  

– Many  methods  of  collec7ng  genotype  data  •  Assays  for  specific  polymorphisms  •  Genome-­‐wide  SNP  chips  

•  Must  have  data  quality  assessment  prior  to  analysis  

One sequencer => 1-2Tb/week !!

Page 5: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Curated  Biological  Data    DNA, nucleotide sequences

Gene boundaries, topology Gene structure

Introns, exons, ORFs, splicing

Expression data Mass spectometry

Page 6: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Mass spectometry (metabolomics, proteomics)

Post-Translational protein Modification (PTM)

Curated  Biological  Data  Proteins, residue sequences

MCTUYTCUYFSTYRCCTYFSCD Extended sequence information

Secondary structure

Hydrophobicity, motif data

Protein-protein interaction

Page 7: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Curated  Biological  data  3D  Structures,  folds  

Page 8: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

WHAT  is  a  database?  •  A  collec7on  of  data  that  needs  to  be:  

–  Structured  –  Searchable  –  Updated  (periodically)  –  Cross  referenced  

•  Challenge:  –  To  change  “meaningless”  data  into  useful  informa7on  that  can  be  

accessed  and  analysed  the  best  way  possible.  

For  example:      HOW  would  YOU  organize  all  biological  sequences  so  that  the  biological  informa7on  is  op7mally  accessible?  

     

  http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics

Page 9: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

A  Spreadsheet  can  be    a  Database  

•  columns  are  Fields    •  Rows  are  Records  •  Can  search  for  a  term  within  just  one  field  

•  Or  combine  searches  across  several  fields  

SNP ID SNPSeq ID!

Gene +primer -primer Hap A Hap B Hap C

D1Mit160_1" 10.MMHAP67FLD1.seq"

lymphocyte antigen 84"

AAGGTAAAAGGCAATCAGCACAGCC"

TCAACCTGGAGTCAGAGGCT"

C — A

M-05554_1" 12.MMHAP31FLD3.seq"

procollagen, type III, alpha "

TGCGCAGAAGCTGAAGTCTA"

TTTTGAGGTGTTAATGGTTCT"

C — A

M-05554_2" X60184" complement component factor i"

ACTTCCAGCCCTGGCTCT"

ATATGCCACCAAGAAGCA"

A C —

M-09947_3" AF067835" caspase 8" TCACAGAGGGAAACATGAAG"

CTCCACATTGAACCAAAGCA"

G C T

M-11415_1" U02023" insulin-like growth factor binding protein "

GGGAAAAGCCTGAAAGAAGC"

AGCTGAAACCGGACATCAAT"

T G —

D1Mit284_3"

J05234" nucleolin" TGTTGGAACCGACTTCTTCA"

AAGAGTCAAAGAATTTATGGAATGA"

G T T

Page 10: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

DBMS  

•  Internal  organiza7on  – Controls  speed  and  flexibility  

•  A  unity  of  programs  that    – Store  – Extract  – Modify  

Database

Store Extract Modify

USER(S)

Page 11: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

DBMS  organisa7on  types  •  Flat  file  databases  (flat  DBMS)  

–  Simple,  restric7ve,  table  

•  Hierarchical  databases  (hierarchical  DBMS)  –  Simple,  restric7ve,  tables  

•  Rela7onal  databases  (RDBMS)  –  Complex,versa7le,  tables  

•  Object-­‐oriented  databases  (ODBMS)  –  Complex,  versa7le,  objects  

•  Data  Warehouses  and  Distributed  Databases    

Information system

Query system

Storage System

Data

Page 12: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Why  are  flat  files  s7ll  used?  

Page 13: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Structured  Data  

•  Repository  of  informa7on  

•  managed  and  accessed  differently  

•  Flat-­‐file  (text)  •  Rela7onal  (key)  •  “talk”  to  each  other  

Page 14: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Rela7onal  databases  

•  Data  is  stored  in  mul7ple  related  tables  

•  Data  rela7onships  across  tables  can  be  either  many-­‐to-­‐one  or  many-­‐to-­‐many  

•  A  few  rules  allow  the  database  to  be  viewed  in  many  ways  

Page 15: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Rela7onal  Databases  

•  What  have  we  achieved?  –  No  repea7ng  informa7on  –  Less  storage  space  –  Be`er  reality  representa7on  –  Easy  modifica7on/management  –  Easy  usage  of  any  combina7on  of  records    

Page 16: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Three  reasons  to  care  …  

•  Database  prolifera7on  – Dozens  to  hundreds  at  the  moment  

•  More  and  more  scien7fic  discoveries  result  from  inter-­‐database  analysis  and  mining  

•  Rising  complexity  of  required  data-­‐combina7ons  – E.g.  transla7onal  medicine:  “from  bench  to  bedside”  (genomic  data  vs.  clinical  data)  

Page 17: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%
Page 18: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Standard  Data  Formats  •  DNA  sequence  =  ACGT,  but  what  about  gaps,  unknown  le`ers,  etc.  –  How  many  le`ers  per  line  ???  –  ??  Spaces,  numbers,  headers,  etc.  –  Store  as  a  string,  code  as  binary  numbers,  etc.      

•  Use  a  completely  different  format  for  proteins?  

 Need  standard  formats!!  

Page 19: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

FASTA  Format  •  William  Pearson  (1985)  

•  The  FASTA  format  is  now  universal  for  all  databases  and  sohware  that  handles  DNA  and  protein  sequences  

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..!CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA!ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT!GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC!CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG!TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA!GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT!CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA!TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

One header line, starts with > with a [return] at end All other characters are part of sequence.

Page 20: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Mul7-­‐Sequence  FASTA  file  >FBpp0074027  type=protein;  loc=X:complement(16159413..16159860,16160061..16160497);  ID=FBpp0074027;  name=CG12507-­‐PA;  

parent=FBgn0030729,FBtr0074248;  dbxref=FlyBase:FBpp0074027,FlyBase_Annota7on_IDs:CG12507  PA,GB_protein:AAF48569.1,GB_protein:AAF48569;  MD5=123b97d79d04a06c66e12fa665e6d801;  release=r5.1;  species=Dmel;  length=294;    

MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ  PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA  SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ  YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR  DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE  IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL  >FBpp0082232  type=protein;  loc=3R:complement(9207109..9207225,9207285..9207431);  ID=FBpp0082232;  name=mRpS21-­‐PA;  

parent=FBgn0044511,FBtr0082764;  dbxref=FlyBase:FBpp0082232,FlyBase_Annota7on_IDs:CG32854-­‐PA,GB_protein:AAN13563.1,GB_protein:AAN13563;  MD5=dcf91821f75ffab320491d124a0d816c;  release=r5.1;  species=Dmel;  length=87;    

MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV  RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS  >FBpp0091159  type=protein;  loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082);  ID=FBpp0091159;  

name=CG33919-­‐PA;  parent=FBgn0053919,FBtr0091923;  dbxref=FlyBase:FBpp0091159,FlyBase_Annota7on_IDs:CG33919-­‐PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801;  MD5=c91d880b654cd612d7292676f95038c5;  release=r5.1;  species=Dmel;  length=191;    

MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW  NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER  RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY  QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN  >FBpp0070770  type=protein;  loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605);  ID=FBpp0070770;  name=cv-­‐PA;  

parent=FBgn0000394,FBtr0070804;  dbxref=FlyBase:FBpp0070770,FlyBase_Annota7on_IDs:CG12410-­‐PA,GB_protein:AAF46063.1,GB_protein:AAF46063;  MD5=0626ee34a518f248bbdda11a211f9b14;  release=r5.1;  species=Dmel;  length=257;    

MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK  NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE  LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN  LDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCC  ECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD  GPVNNNY  …  

Page 21: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Reformaung  Data  Files  

•  Much  of  the  rou7ne  (yet  annoying)  work  of  bioinforma7cs  involves  messing  around  with  data  files  to  get  them  into  formats  that  will  work  with  various  sohware  

•  Then  messing  around  with  the  results  produced  by  that  sohware  to  create  a  useful  summary…  

Page 22: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Accession  Numbers!!  (keys)  •  Databases  are  designed  to  be  searched  by  accession  numbers  (and  locus  IDs)  

•  These  are  guaranteed  to  be  non-­‐redundant,  accurate,  and  not  to  change.  

•  Searching  by  gene  names  and  keywords  is  doomed  to  frustra7on  and  probable  failure  

Neither  scien7sts  nor  computers  can  be  trusted  to  accurately  and  consistently  annotate  database  entries!!  

Page 23: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Accessing  database  informa7on  

•  A  request  for  data  from  a  database  is  called  a  query  

•  Queries  can  be  of  three  forms:  – Choose  from  a  list  of  parameters  – Query  by  example  (QBE)  – Structured  Query  Language  (SQL)  

Page 24: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Web  Query  

•  Most  databases  have  a  web-­‐based  query  tool  

•  It  may  be  simple…  

Page 25: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

…  or    complex  

Page 26: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Query  Languages  •  The  standard    

– SQL  (Structured  Query  Language)  originally    called  SEQUEL  (Structured  English  QUEry  Language)  

– Developed  by  IBM  in  1974;  introduced  commercially  in  1979  by  Oracle  Corp.  

– Standard  interac7ve  and  programming  language  for  geung  informa7on  from  and  upda7ng  a  database.  

– RDMS  (SQL),  ODBMS  (Java,  C++,  OQL  etc)  

Page 27: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Database  Searching  A  database  can  only  be  searched  in  ways  that  it  was  designed  to  be  searched  

   Boolean:  "AND"    and  "OR"  searches  

 

Bad  to  search  for  "human  hemoglobin"  in  a  'Descrip-on'  field  

Much  be`er  to  search  for  "homo  sapiens  in  'Organism'    AND  "HBB"  in  'gene  name'  

Page 28: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Strategies  

•  Use accession numbers whenever possible •  Start with broad keywords and narrow the

search using more specific terms •  Try variants of spelling, numbers, etc. •  Search all relevant databases

• Be persistent!!

Page 29: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Data  versus  metadata  (annota7on)  

Page 30: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Primary  vs  derived  data  

Page 31: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Heterogeneity  in  data  (Scien7fic  data  domains)  

Page 32: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Genome  Ontology  •  Biology  is  a  messy  science  

•  Assortment  of  names,  mutants,  odd  phenotypes  –  “sonic  hedgehog”  

•  Genome  Ontology  – Molecular  func7on  (specific  tasks)  – Biological  process  (broad  biological  goal)  – Cellular  component  (loca7on)    

 

Page 33: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

GiGo:  Data  Quality  Ma`ers  

Page 34: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, !

ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,!BioMagResBank, BIOMDB, BLOCKS, BovGBASE,!

BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,!CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,!

ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,!CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,!Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,!ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,!ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,!

GCRDB, GDB, GENATLAS, Genbank, GeneCards,!Genline, GenLink, GENOTK, GenProtEC, GIFTS,!

GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,!HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,!

HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,!HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,!

KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,!Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5!

Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,!MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,!OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,!PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,!

PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,!PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,!

SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,!SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,!

SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-!MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,!TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,!VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,!

YPM, etc .................. !!!!!!

Some Biological databases …

Page 35: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Some  sta7s7cs  •  More  than  1000  different  databases  •  Generally  accessible  through  the  web      (useful  link:  www.expasy.ch/alinks.html)    •  Variable  size:  <100Kb  to  >10Gb  

–  DNA:  >  10  Gb  –   Protein:  1  Gb  –  3D  structure:  5  Gb  –  Other:  smaller  

•  Update  frequency:  daily  to  annually  

Page 36: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

NAR  Database  Issue  

•  Online  collec7on  of  biological  databases:  h`p://www.oxfordjournals.org/nar/database/c/    

Page 37: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

GenBank

DDBJ EMBL

EMBL

Entrez

SRS

getentry

NIG CIB EBI

NCBI

NIH

• Submissions • Updates

• Submissions • Updates

• Submissions • Updates

Public  Sequence  Databases  Same sequence information in all three, but different tools for searching and retrieval

Page 38: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

GenBank  •  Contains  all  DNA  and  protein  sequences  described  in  the  scien7fic  literature  or  collected  in  publicly  funded  research  

•  Fla{ile:  Composed  en7rely  of  text  •  Each  submi`ed  sequence  is  a  record  •  Had  fields  for  Organism,  Date,  Author,  etc.  •  Unique  iden7fier  for  each  sequence    

– Locus  and  Accession  #  

Page 39: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Growth  of  Genbank  

Page 40: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%
Page 41: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

GenBank  Flat  File  (GBFF)  LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat //

Features (AA seq)

DNA Sequence

Header • Title • Taxonomy • Citation

Page 42: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Fields

Page 43: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

h`p://www.ncbi.nlm.nih.gov/Genbank  

•  Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year.

•  At least doubles in size every 18 months

•  There  are  approximately  130,671,233,801  bases,  from  142,284,608  reported  sequences  in  the  tradi7onal  GenBank  divisions  as  of  August  2011.

Page 44: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Distribu7on  of  sequence  databases  

•  Books,  ar7cles    1968  -­‐>  1985  •  Computer  tapes  1982  -­‐>1992  •  Floppy  disks    1984  -­‐>  1990  •  CD-­‐ROM        1989  -­‐>  ?  •  FTP            1989  -­‐>  ?  •  On-­‐line  services        1982  -­‐>  1994  •  WWW        1993  -­‐>  ?  •  DVD                  2001  -­‐>  ?  •  Mailing  hard  drives      2009  -­‐>  ?  

Page 45: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

•  Many  sequences  in  GenBank  correspond  to  the  same  gene  

•  genomic  clones,  full  length  mRNA,  various  kinds  of  ESTs,  submi`ed  by  different  inves7gators  

•  RefSeq  is  the  “Reference  Sequence”  for  a  gene  -­‐  as  determined  by  GenBank  curators  –  best  guess  given  the  current  evidence,  can  change  –  usually  based  on  the  longest  mRNA  –  usually  has  both  5’  and  3’  UTR    

•  Not  necessarily  reliable  –  A  lot  is  not  yet  known…  eg,  alterna7ve  splicing  

Page 46: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Last  thoughts  on  Genbank  ...  

•  Ohen  only  use  FASTA  files  (eg  for  BLAST)  •  GBFF  are  simply  human  readable  versions  of  these  records  

•  GBFF  have  become  a  vehicle  for  a  lot  more  informa7on  than  they  where  meant  to  do  

•  Keep  in  mind  that  GenBank  is  DNA  centric  and  is  a  poor  vehicle  for  protein  and  mRNA  expression/interac7on  informa7on  

Page 47: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Many  Datasets  at  NCBI  •  The  NCBI  hosts  a  huge  interconnected    database  system  that,  in  addi7on  to  DNA  and  protein,  includes:  –  Journal  Ar7cles  (PubMed)  – Gene7c  Diseases  (OMIM)  – Polymorphisms  (dbSNP)  – Gene  Expression  (GEO)  – Cytogene7cs  (CGH/SKY/FISH  &  CGAP)  – Taxonomy  – Chemistry  (PubChem)  

Page 48: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%
Page 49: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Ensembl at EBI/EMBL

http://genome.cshlp.org/content/14/5/971.full

Page 50: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

KEGG:  Kyoto  Encylopedia  of    

Genes  and  Genomes  •  Enzyma7c  and  regulatory  pathways  •  Mapped  out  by  EC  number  and  cross-­‐referenced  to  genes  in  all  known  organisms      (wherever  sequence  informa7on  exits)  

•  Parallel  maps  of  regulatory  pathways  

Page 51: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%
Page 52: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%
Page 53: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

http://www.wwpdb.org

Page 54: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Golden  Rules    

•  Use  published  databases  and  methods  – Supported,  maintained,  trusted  by  community  

•  Document  what  you  have  done  !!!  – Sequence  iden7fica7on  numbers  – Server,  database,  program  VERSION  – Program  parameters  

•  Assess  reliability  of  results  

Page 55: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

Bio-­‐databases:  A  short  word  on  problems  

•  Even  today  we  face  some  key  limita7ons  –  There  is  no  single  standard  format  

•  Every  database  or  program  has  its  own  format  

–  There  is  no  standard  nomenclature  •  Every  database  has  its  own  names  

–  Data  is  not  fully  op7mized  •  Some  datasets  have  missing  informa7on  without  indica7ons  of  it  

–  Data  errors  •  Data  is  some7mes  of  poor  quality,  erroneous,  misspelled  •  Error  propaga7on  resul7ng  from  computer  annota7on  

Page 56: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

What  to  take  home  •  Databases  are  a  collec7on  of  data  

–  Need  to  access  and  maintain  easily  and  flexibly  

•  Biological  informa7on  is  vast  and  some7mes  very  redundant  

•  Computers  can  only  create  data,  they  do  not  give  answers  

•  Learn  to  use  the  big  reliable  databases  (e.g.  NCBI)  

Page 57: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

•  Open  access  to  sequences  is  not  only  essen7al  for  all  of  the  work  we  do,  if  it  was  not  there,  there  would  be  no  bioinforma7cs,  no  BLAST,  no  Computa7onal  Bioscience  Program  

•  Open  access  to  sequence  informa7on  is  not  all  that  needs  to  be  open.    We  also  need  open  access  to  the  literature.  

Page 58: Biological Databases - Computational Bioscience Program at ...compbio.ucdenver.edu/77112015/Dowell database-15.pdf · BiologistsCollectLotsof(Data % • Hundreds%of%thousands%of%species%

http://mibiol.biol.lu.se.webbhotell.ldc.lu.se/Bioinformatics/Exercises/databases.html

http://wiki.bio.dtu.dk/teaching/index.php/Exercise:_Searching_the_GenBank_database

http://biocourse.sanbi.ac.za/wp-content/uploads/2013/02/Biological-Databases-Exercises.pdf

RECOMMENDED EXERCISES