60
Introduc)on to Databases part 2 Shifra Ben‐Dor Irit Orr

Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Embed Size (px)

Citation preview

Page 1: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Introduc)ontoDatabasespart2

ShifraBen‐DorIritOrr

Page 2: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Andnow,forthemoleculesanddatabases...

•  DNA

•  RNA

•  Protein

Page 3: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

DNAsequences

•  Genesareencodedingenomicsequences.

•  Genesaretranscribedintopre‐mRNAs(includingcoding,intronic,5’and3’untranslatedregions).

• mRNA’sarespliced(intronsremoved)andtranslatedintoproteins.

• mRNAsarecopiedtocDNAs

Page 4: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

TSS TTS

ATG Stop PolyA site

Promoter 1 2 3 4

ATG Stop PolyA site

1 2 3 4

Genomic DNA

Pre-mRNA

mRNA

Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.

ATG Stop

1 2 3 4 Cap PolyA

5’ UTR 3’ UTR CDS

Page 5: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Interna)onalDNAdatabases

  GenbankatNCBI  hNp://www.ncbi.nlm.nih.gov/

  EMBLatEBI

  hNp://www.ebi.ac.uk/embl/

  DDBJinJapan  hNp://www.ddbj.nig.ac.jp/

Page 6: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

DATAsourcesforDNAdatabases

•  Directscien)stsubmission

•  Genomesequencinglabsandgroups

•  Scien)ficliterature•  Patentapplica)ons

•  EMBL,GenbankandDDBJcollaboratetocollectallsequencedatareportedaroundtheworld.

Page 7: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Interna)onalDNAdatabases

  Allofthesedatabaseshave:

  Officialreleasesevery2‐3months.

  Weekly(ordailyupdates).

  Aredividedintosublibrariesforeasiersearching.

Page 8: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

DNAdatabasedivisions

•  PRI‐primate(human,monkey)•  ROD‐rodent(mouse,rat)•  MAM‐othermammalian

(bovine,cat)•  VRT‐othervertebrate(chicken)•  INV‐invertebrate•  PLN‐plant,fungal,andalga•  BCT‐bacteria•  VRL‐viruses•  PHG‐bacteriophage•  SYN‐synthe)c(plasmids,vectors)•  UNA‐unannotatedsequences•  PAT‐patentsequences

•  EST‐ExpressedSequenceTags•  STS‐SequenceTaggedSites

•  GSS‐GenomeSurveySequences

•  HTG‐HighThroughputGenomicSequences

•  HTC‐HighThroughputcDNASequences

Page 9: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

ShortReadandTraceArchives

Theoutputoflargescalesequencingprojectsandnext‐genera)onsequencingarestoredinseparatedatabases.NCBIisphasingouttheSRA,butthedatawillbeavailableinGEO,thedatabaseformicroarrayresults.

Page 10: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Genomicdatabases

•  Specializedresourcesthatare:– Speciesspecific– Sequencingtechniquespecific

•  Displaywholechromosomes(notaspecificsequence).

Page 11: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

SourcesofmRNA’s

•  Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem,RNA‐Seq...

•  Database– “Typical”cDNA– FulllengthcDNA– EST

Page 12: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

mRNA

Full length cDNA

Typical cDNA

5’mG AAAA

TTTT

TTTT

primer

AAAA primer

primer

Page 13: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

REFSEQNCBI(Referencesequencedatabase)

✵  Definition

  The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

Page 14: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

REFSEQ from NCBI  non-redundancy  explicitly linked nucleotide and protein

sequences  updates to reflect current knowledge of sequence

data and biology  data validation and format consistency  distinct accession series  ongoing curation by NCBI staff and collaborators,

with reviewed records indicated

Page 15: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

RefSeqrecordStatus

•  TheRefSeqCOMMENTblockindicatestheStatusoftherecordandtheGenBanksequencedatathatwasusedtoprovidetherecord.

•  Inaddi)on,theCOMMENTmayiden)fyacollabora)onwhichsuppliedthedefiningsequenceinforma)onforthegenome,gene,orprotein.

Thelevelofcura)onmaydifferbetweendifferentcollabora)nggroups.

Page 16: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

RefSeq

•  Reviewed*•  Provisional•  Predicted

•  GenomeAnnota)on

•  Validated*•  Model

•  Inferred

•  WGS

✵ StatusCodes: RefSeqrecordsareprovidedwithastatuscodewhichprovidesanindica)onofthelevelofreviewaRefSeqrecordhasundergone.

*Curated

Page 17: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

STATUSDefini+on

REVIEWEDTheRefSeqrecordhasbeenthereviewedbyNCBIstafforbycollaborator.TheNCBIreviewprocessincludesreviewingavailablesequencedataandfrequentlyalsoincludesareviewoftheliteratureandothersourcesofinforma)on.

VALIDATED

TheRefSeqrecordhasundergoneanini)alreviewtoprovidethepreferredsequencestandard.Therecordhasnotyetbeensubjecttofinalreviewatwhich)meaddi)onalfunc)onalinforma)onmaybeprovided.

PROVISIONALTheRefSeqrecordhasnotyetbeensubjecttoindividualreviewandisthoughttobewellsupportedandtorepresentavalidtranscriptandprotein.

Page 18: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

STATUSDefini+on

PREDICTEDTheRefSeqrecordispredictedandhasnotbeensubjecttoindividualreview.Thetranscriptmayrepresentanabini&opredic)onormaybepar)allysupportedbyothertranscriptdata;inbothcases,theproteinispredicted.

INFERREDTheRefSeqrecordisinferredbygenomesequenceanalysis.Thereisnosame‐organismexperimentalsupportforthefullextentofthesequence;theremaybesomelevelofsupportbyhomology.

MODELTheRefSeqrecordispredictedbygenomesequenceanalysis.Therecordmayrepresentanabini&opredic)on,ormayhavesomeleveloftranscriptorhomologysupport.

Page 19: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

STATUSDefini+on

GENOMEANNOTATION Thisiden)fiesRefSeqrecordsprovidedbytheNCBIGenomeAnnota)onprocess.Theserecordsareprovidedviaautomatedprocessingandarenotsubjecttoindividualrevieworrevisionbetweenbuilds

WGS

TheRefSeqrecordrepresentsacollec)onofwholegenomeshotgun(WGS)sequences.Thisstatuscodeisappliedtogenomicrecords

Page 20: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

AccessionFormat MoleculeType

NC_123456 CompleteGenome CompleteChromosome CompleteSequence

NG_123456 GenomicRegion

NM_123456 mRNA

NR_123456 non‐codingRNA

NP_123456 Protein

NT_123456 GenomicCon)g(fromBACs)

NW_123456 GenomicCon)g(fromWGS)

XM_123456 mRNA(takenfromgenomicseq)

XR_123456 RNA(takenfromgenomicseq)

XP_123456 Protein(takenfromgenomicseq)

Page 21: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

WhatisthedifferencebetweenRefSeqand

GenBank?Genbankis:

•  ArchivaldatabaseandincludespubliclyavailableDNAsequencessubmiNedfromindividuallaboratoriesandlarge‐scalesequencingprojects.

•  AccessionnumbersareassignedtothesesubmiNedsequences.

•  SubmiNedsequencedataisexchangedbetweenNCBIsGenBank,EMBLDataLibrary(EMBL)andtheDNADataBankofJapan(DDBJ)toachievecomprehensiveworldwidecoverage.

•  Asanarchivaldatabase,GenBankisveryredundantforsomeloci.

•  SequencerecordsareownedbytheoriginalsubmiNerandcannotbealteredbyathirdparty.

Page 22: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

WhatisthedifferencebetweenRefSeqand

GenBank?RefSeqis:

 SequencesarederivedfromGenBankandprovidenon‐redundantcurateddata.

 Entriesrecordsrepresentcurrentknowledge. RefSeqrecordsareownedbyNCBIandthereforecanbe

updatedasneededtomaintaincurrentannota)onortoincorporateaddi)onalsequenceinforma)on.

 Somerecordsincludeaddi)onalsequenceinforma)onthatwasneversubmiNedtoanarchivaldatabasebutisavailableintheliterature.

 Somesequencerecordsareprovidedthroughcollabora)on;andthusmaynotbeavailableinanyoneGenBankrecord.

 RefSeqsequencesarenotsubmiNedprimaryseqs.

Page 23: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

VariousHighThroughputCollec)onsNedo,DFKZ,HRI,Genoscope

•  Full‐lengthcDNAlibrariesfromvarious)ssuesweresubtractedandnormalizedtoreduceredundancy

•  Cloneswereend‐sequencedtofurtherreduceredundancy

•  WholeinsertsweresequencedtogetmRNAsequences

•  [KIAA–donebyKazusawasaprojectforlongcDNAs–over4kb,butmaynotbefull‐length]

Page 24: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

MGC‐MammalianGeneCollec)on

TheNIHMammalianGeneCollec)on(MGC)seekstoiden)fyandsequencearepresenta)vefullopenreadingframe(ORF)cloneforeachhuman,mouse,ratandcowgene.ZebrafishandXenopushavetheirownprojects(ZGCandXGC)

MGCproducedover80cDNAlibrariesenrichedforfull‐lengthcDNAsderivedfromhuman)ssueandcelllines,andmouse)ssue.

5'ESTreadsweregeneratedfromeachlibrary.Severalalgorithmsareappliedtoselectputa)vefullORFclones.Targetedcloningorsynthesiswasusedtofinish.

Page 25: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

SourcesofmRNAs

•  IndividualLabs various

•  Refseq XX_123456

FullLengthSequencingprojects:

•  Riken,Nedo(FLJ),HRI AK,CR

DKFZ,Genoscope,[KIAA]... [AB,D]

•  MGC BC,CT

AccessionNumbers

Page 26: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

SourcesofmRNA’s

•  Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem

•  Database– “Typical”cDNA– FulllengthcDNA– EST

Page 27: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

RNA

RNA, cDNA, and ESTs

mRNA

cDNA

exon 1 exon 2 exon 3

EST

EST

cDNA clone

Adapted with permission from Adam Sartiel

Page 28: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

UsesofESTs

‐ predic)onofcodingregions‐ detec)onofalterna)vesplicing‐ clusteringtoform“genes”

Problemswithclustering:‐ incompletecoveragebreaksgenesup‐ genefamilies

Page 29: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

ProblemswithESTs

‐ lowcopynumbergenes

‐ rare)ssues‐ mistakes

‐ enrichmentof3’endsofgenes

‐ incompletecoverageofgenes

Page 30: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

With the increasing sequencing and annotation of key genomes, having a gene-based view of the resultant information is useful. Entrez Gene has therefore been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information.

Page 31: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

EntrezGeneatNCBI

EntrezGene‐Adatabaseforgene‐specificinforma)on.

Itdoesnotincludeallknownorpredictedgenes;insteadEntrezGenefocusesonthegenomesthathavebeencompletelysequenced,thathaveanac)veresearchcommunitytocontributegene‐specificinforma)on,orthatarescheduledforintensesequenceanalysis.

ThecontentofEntrezGenerepresentstheresultofcura)onandautomatedintegra)onofdatafromNCBI'sReferenceSequenceproject(RefSeq),fromcollabora)ngmodelorganismdatabases,andfrommanyotherdatabasesavailablefromNCBI.Recordsareassignedunique,stableandtrackedintegersasiden)fiers.

Page 32: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

EntrezGeneatNCBI

Thecontent(nomenclature,maploca)on,geneproductsandtheiraNributes,markers,phenotypes,andlinkstocita)ons,sequences,varia)ondetails,maps,expression,homologs,proteindomainsandexternaldatabases)isupdatedasnewinforma)onbecomesavailable.

EntrezGenedataisusedbyotherNCBIresourcessuchas:BLAST,Geo,HomoloGene,MapViewer,UniGene,UniSTSandNCBI'sgenomeannota)onpipeline.

Page 33: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Datareliabilityindatabases

Thehugeamountofdatacollectedindatabasespresentalotofproblems:

– Dataaccuracy–  Sequenceredundancy–  Inconsistentnomenclature

–  Inaccurateannota)on–  Sequencecontamina)on(vectors,bacterial)

Page 34: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Datareliabilityindatabases

•  Thedatabasestaffno)fytheAuthorsthatanerror(orcontamina)on)wasdetectedintheirsequenceentry.

•  However,ittakes)metocorrectthedata.

• Meanwhiletheerroriscon)nued,becausealotoftheProteinsintheProteindbaretranslatedfromtheDNAsequencedb.

Page 35: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Datareliabilityindatabases

•  Alotofthesequencesinthedatabasearequite“old”.TheywerenotupdatedsincetheyweresubmiNed,eventhoughtechnologyanddatawasverymuchupdated.

Page 36: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Genesymbols

GenesymbolsaredesignatedbyuppercaseLa)nleNersorbyacombina)onofupper‐caseleNersandArabicnumbers.

Symbolsshouldbeshortinordertobeuseful,andshouldnotaNempttorepresentallknowninforma)onaboutagene.

Ideallysymbolsshouldbenolongerthansixcharactersinlength.

Basedonclassicalgene)cguidelines,itisrecommendedthatgenesymbolsareeitherunderlinedoritalicizedwhenreferringtogenotypicinforma)on(phenotypicinforma)onisrepresentedinstandardfonts).

Page 37: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

HUGOGeneNomenclatureCommiNee

•  ThiscommiNeeisresponsiblefortheapprovalofauniquesymbolforeachgene.

•  Italsodesignsalongerandmoredescrip)vename.

•  ThecommiNeemakesconsiderableeffortstousesymbolsacceptabletoworkersinthefield,butsome)mesitisnotpossibletouseexactlywhathaspreviouslyappearedintheliterature.

•  However,whereverthecommiNeeisawareofsuchsymbols,theyarelistedasaliasesintheGenewdatabase.(hIp://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)

Page 38: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

GeneSymbols

80887826000469q31ATP‐bindingcasseNe,sub‐familyA(ABC1)member1

ABCA1

PubMedID

MIMNumber

Cytogene)cLoca)on

FullnameSymbol

Page 39: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

TaxonomyDatabases

•  Aninterna)onaleffortisdoneforallsequencedatabasestocreateaunifiedtaxonomictagforthesequencessubmiNed.

  Problem:eachsequencedepositorgives“his”nameforthespecie

  Solu)on:UnifiedtaxonomyID

Page 40: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Proteindatabases

Page 41: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Proteindatabases

•  Therearemanydifferentproteindatabasescontainingdifferenttypesofinforma)on:

–  PrimaryAminoAcidssequence.

–  Secondarystructure–  3Dstructure–  Proteinfamilydomains

–  Consensusac)vesites

Page 42: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

SourcesofProtein

•  Proteinsthathavebeenworkedonexperimentally

• mRNAwhoseproducthasbeenworkedonexperimentally(noactualproteinsequencingdone)

•  TranslatedDNA(mRNA)sequences

Page 43: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

ProteinPrimarySequenceDatabases

•  Usuallycontaindescrip)onoftheproteinentry(annota)on),theaminoacidsequenceandsome)meslinkstootherrelateddatabases.

•  Swiss‐Prot,fromtheUniversityofGeneva(nowtheSwissIns)tuteofBioinforma)cs),isacuratedproteindatabasewhichstrivestoprovideahighlevelofannota)on,aminimallevelofredundancyandhighlevelofintegra)onwithotherdatabases.

Page 44: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

•  The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, classification, and cross-reference.

•  The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches.

•  The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

Page 45: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Swiss‐ProtDatabase(primarydatabase)

•  Swiss‐Protannota)onincludes:– Descrip)onofproteinfunc)on–  Proteindomainstructure–  Post‐transla)onalmodifica)ons–  Proteinvariants

•  Sequenceentriesarecomposedofdifferentline‐types,eachwiththeirownformat.Forstandardiza)onpurposestheformatofSwissProtfollowsascloselyaspossiblethatoftheEMBL(DNA)Database.

Page 46: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Swiss‐ProtDatabase

Swiss‐Protdiffersfromotherproteindatabasesbythefollowingcriteria:

 Annota)on

 MinimalRedundancy

  Integra)onwithotherdatabases

Page 47: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Swiss‐ProtDatabase

 Annota)on   InSwiss‐Prot,asinmostothersequencedatabases,twoclassesofdatacanbedis)nguished:thecoredataandtheannota)on.

  Thecoredataconsistsofthesequence;thecita)oninforma)on(bibliographicalreferences)andthetaxonomicdata(descrip)onofthebiologicalsourceoftheprotein).

Page 48: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

  Theannota)onconsistsofthedescrip)onof:

•  Func)on(s)oftheprotein•  Post‐transla)onalmodifica)on(s).Forexamplecarbohydrates,phosphoryla)on,acetyla)on,GPI‐anchor,etc.

•  Domainsandsites.Forexamplecalciumbindingregions,ATP‐bindingsites,zincfingers,etc.

•  Secondarystructure

Page 49: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

  Theannota)onconsistsofthedescrip)onof:

•  Quaternarystructure.Forexamplehomodimer,heterotrimer,etc.

•  Similari)estootherproteins•  Disease(s)associatedwithdeficiency(s)of/intheprotein

•  Sequenceconflicts,variants,etc.

Page 50: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Swiss‐ProtDatabase

  Toobtainthisinforma)on,Swiss‐Protuses,inaddi)ontothepublica)onsthatreportnewsequencedata,reviewar)clestoperiodicallyupdatetheannota)onsoffamiliesorgroupsofproteins.

  Swiss‐Protalsomakesuseofexternalexperts,whohavebeenrecruitedtosendtheircommentsandupdatesconcerningspecificgroupsofproteins.

Page 51: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Swiss‐ProtDatabase

 MinimalRedundancy   Manysequencedatabasescontain,foragivenproteinsequence,separateentrieswhichcorrespondtodifferentliteraturereports.InSWISS‐PROT,theytryasmuchaspossibletomergeallthesedatasoastominimizetheredundancyofthedatabase.

  Ifconflictsexistbetweenvarioussequencingreports,theyareindicatedinthefeaturetableofthecorrespondingentry.

Page 52: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Swiss‐ProtDatabase

  Integra)onwithotherdatabases   Itisimportanttoprovidetheusersofbiomoleculardatabaseswithadegreeofintegra)onbetweenthethreetypessequence‐relateddatabases(nucleicacidsequences,proteinsequencesandproteinter)arystructures)aswellaswithspecializeddatacollec)ons.

  SWISS‐PROTiscurrentlycross‐referencedwith~100differentdatabases.Cross‐referencesareprovidedintheformofpointerstoinforma)onrelatedtoSWISS‐PROTentriesandfoundindatacollec)onsotherthanSWISS‐PROT.

Page 53: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

TrEMBLdatabase

•  TrEMBLisacomputer‐annotatedsupplementofSWISS‐PROTthatcontainsallthetransla)onsoftheEMBL(DNA)database.

•  TrEMBLcontainentriesnotyetintegratedinSWISS‐PROT.

Page 54: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

•  Combinesinforma)onnotinotherdatabases,likemicroarraydata,popula)onvaria)onstudies,proteomics

•  Powerfulqueryingop)ons

•  Onlyforhumanproteins

Page 55: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

NRdatabase(primarydatabasesfromNCBI)

•  TheNRProteindatabasecontainssequencedatafromthetranslatedcodingregionsfromDNAsequencesinGenBank,EMBLandDDBJaswellasproteinsequencessubmiNedtoPIR,SWISSPROT,PRF,PDB(sequencesfromsolvedstructures).

Page 56: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

DatareliabilityinProteindatabases

•  About30%oftheproteinsinthedatabaseshaveerroneoussequencesdueto:– missingexonsintheDNAtransla)on.– Intronsmistakenlytranslated.

•  Anothercommonproblemistheassigningoffunc)onsto“new”proteins,basedonsequencesimilarity.

Page 57: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

DatareliabilityinProteindatabases

•  Forexample:– ProteinAissimilartoproteinB.

– ProteinBannota)onisbasedonProteinAannota)on(whichhasanerror).

– Annota)onofProteinAiscorrectedbythegroupworkingonit.Thiscorrec)ondoesnotappearorreflectinProteinBannota)on.

– WhenProteinCandDarealsobasedontheerroneousannota)ononB,theproblem…...

Page 58: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Textsearchingpi{alls

•  Itfindsexactlywhatyoutype(trypseudogenevs.psuedogene)

•  Olderrecordsmayhavedifferentannota)on,fromgenenameson…

•  humanvshomosapiens

•  Genesymbolsvsfullgenename(forexampleneuregulinvsnrg1)

Page 59: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

• Mostsitesusebooleanoperators(AND,OR,BUTNOT)

•  Cando(oradd)afieldspecifictag‐buteachsitehasadifferentwayofaddingittoasearch‐forexample,NCBIusessquarebrackets[]

Page 60: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified

Remember:

TextsearchingisNOTsequencesimilaritysearching!Youmanynotfindallrelatedsequencesbytextsearching!!!!