Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
BioinfRes SoSe 16
Bioinforma)csResources-Genbank-
Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb
Ins)tutfürInforma)kI12
BioinfRes SoSe 16
Na)onalCenterforBiotechnologyInforma)on,NCBI
http://nihrecord.nih.gov/newsletters/2013/07_19_2013/images/milestonesPic6.jpg
● firstideasinthemiddleofthe80s
● divisionoftheNa)onalLibraryofMedicine(NLM)insidetheNa)onalIns)tutesofHealth(NIH)
● poli)calmission
● foundedin1988
● DavidLipman
BioinfRes SoSe 16
NCBI’spoli)calmissionasdefinedbythebill:1. design,develop,implement,andmanageautomatedsystems
forthecollec)on,storage,retrieval,analysis,anddissemina)onofknowledgeconcerninghumanmolecularbiology,biochemistry,andgene)cs;
2. performresearchintoadvancedmethodsofcomputer-basedinforma)onprocessingcapableofrepresen)ngandanalyzingthevastnumberofbiologicallyimportantmoleculesandcompounds;
3. enablepersonsengagedinbiotechnologyresearchandmedicalcaretousesystemsdevelopedunderparagraph(1)andmethodsdescribedinparagraph(2);and
4. coordinate,asmuchasisprac)cable,effortstogatherbiotechnologyinforma)ononaninterna)onalbasis.
BioinfRes SoSe 16
SelectedNCBIAccomplishmentsBlastGenBankatNCBI
NCBIwebsite
GenomesOMIM
PubMed
1990
1992
1994
1995
1996
1997
HumanGenomePubMedCentral
EntrezGene/DTDs
NIHPublicAccessGenomeReferenceConsor)um
1000GenomesProject
1999
2000
2003
2005
2007
2008
BioinfRes SoSe 16
NCBIResources● NCBIcurrentlyhostsavastbunchofresourceshap://www.ncbi.nlm.nih.gov/guide/all/
● groupedaccordingtovariouscriteria- metadata,project-centric- methodoriented- topicoriented
● sortedinthesec)ons:databases,downloads,submissions,tools,howtos
BioinfRes SoSe 16
Genbank’sOrigin
● WalterGoad,LosAlamosNa)onalLaboratory
● LosAlamosSequenceDatabase1979
● Crea)onandreleaseofGenBankin1982
● Endof1982:2000sequences
● MovetoNCBIin1992http://www.lanl.gov/science-innovation/features/innovations/images/light/thumbnails/21.jpg
BioinfRes SoSe 16
Minutesfrom20thanniversaryofGenBankin2002
“....AmongthemisamemoonLosAlamosNa)onalLaboratorysta)onerydatedMay9,1980,thatreads:Monday,May12at10:30SteveSimoninvitesyouforcakeandcoffeetocelebrate100,000basesnowintheDNAsequencelibrary.”
takenfromhaps://www.genomeweb.com/genbank-turns-20
BioinfRes SoSe 16
GrowthofGenBankandWGS
-doublingapprox.every18months,diagramforrelease207,Apr.2015-currentversion:release213,Apr.2016:211.423.912.047basesinGenbank,1.452.207.704.949basesinWGS-takenfromhap://www.ncbi.nlm.nih.gov/genbank/sta)s)cs
BioinfRes SoSe 16
GrowthofGenBankandWGS
-currentrelease213:193.739.511sequenceinGenbank,338.922.537sequencesinWGS-takenfromhap://www.ncbi.nlm.nih.gov/genbank/sta)s)cs,release207,Apr.2015
BioinfRes SoSe 16
ReferencesforGenBank● thecurrentcita)onsource:“GenBank”.NucleicAcidsRes.2014Jan;42(Databaseissue):D32-7.doi:10.1093/nar/gkt1030.Epub2013Nov11.
● PMID:24217914● partoftheInterna)onalNucleo)deSequenceDatabaseCollabora)on(INSDC)togetherwithEMBLNucleo)deSequenceDatabase(EMBL-Bank),partoftheEuropeanNucleo)deArchive(ENA)andtheDNADataBankofJapan(DDBJ)
BioinfRes SoSe 16
MostGrowingDivisionsDivision Description Release 197
(8/2013) Annual Increase (%)
WGS* Whole-genome shotgun data 500.420.412.665 62.4.
TSA* Transcriptome shotgun data 8.6333123.935 49.9
PHG Phages 119.812.712 42.5
VRL Viruses 1.757.202.472 22.9
BCT Bacteria 10.281.048.518 21.8
ENV Environmental samples 3.743.277.434 10.9
INV Invertebrates 2.737.140.464 9.8
PAT Patented sequences 13.290.161.247 9.7
PLN Plants 5.963.882.822 8.8
GSS Genome survey sequences 23.726.384.753 8.1
VRT Other vertebrates 3.068.956.026 6.3
MAM Other mammals 911.342.025 5.6
... ... ... ...
TOTAL All GenBank sequences 654.613.333.676 45.1
* not distributed with the release; there specific project server sections
BioinfRes SoSe 16
TopOrganisms(Rel.207)Organism Entries Non-WGS base
pair Homo sapiens 20.921.637 17.714.786.437 Mus musculus 9.727.522 9.995.696.539
Rattus norvegicus 2.193.812 6.526.236.496 Bos taurus 2.227.298 5.410.360.312 Zea mays 4.177.175 5.201.714.457 Sus scrofa 3.297.029 4.895.127.638
Danio rerio 1.727.668 3.133.901.682 Triticum aestivum 1.796.780 1.927.718.314
... ... ... Oryza sativa
Japonica Group 1.376.410 1.265.556.227
... ... ... Arabidopsis thaliana 2.578.785 1.202.100.008
... ...
BioinfRes SoSe 16
Distribu)onofSequenceFiles(Rel.207)Division Number of Files
BCT 178 CON 317 ENV 81 EST 478 HTG 142 INV 126 PAT 219 PLN 107 TSA 175 VRL 34
Release 207 consists of 2333 text files in total.
BioinfRes SoSe 16
DatabaseFiles
● GenBankcomesinasetofcompressedtextfilesavailableviaFTP
● seejp://jp.ncbi.nih.gov/genbank/gbrel.txt● 2333ASCIIfiles(listedindivisionplusaddi)onallistfiles)intherangeof0.7-520MB
● uncompressed~709GB● eachfileconsistsoftwopor)ons
BioinfRes SoSe 16
DatabaseFiles● Part1:highlyconserveddatabasefileheaders1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- GBBCT1.SEQ Genetic Sequence Data Bank April 15 2015 NCBI-GenBank Flat File Release 207.0 Bacterial Sequences (Part 1) 51396 loci, 92682287 bases, from 51396 reported sequences ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79
● Part1:sequenceentriesforthatdivisiondescribedintheheader
BioinfRes SoSe 16
TheGenBankFlatFileFormat
● asequenceentryconsistsofmanyrecords(lines)● eachrecordconsistsoftwoparts
● Part1:columns1-10/EntryFieldName
● Part2:remaininglinewiththecontent
BioinfRes SoSe 16
Part1/1● akeyword,beginningincolumn1oftherecord(e.g.,REFERENCEisakeyword)
● asubkeywordbeginningincolumn3,withcolumns1and2blank(e.g.,AUTHORSisasubkeywordofREFERENCE)
● orasubkeywordbeginningincolumn4,withcolumns1,2,and3blank(e.g.,PUBMEDisasubkeywordofREFERENCE)
BioinfRes SoSe 16
Part1/2
● blankcharacters,indica)ngthatthisrecordisacon)nua)onoftheinforma)onunderthekeywordorsubkeywordaboveit
● acode,beginningincolumn6,indica)ngthenatureofanentry(featurekey)intheFEATUREStable
BioinfRes SoSe 16
Part1/3● anumber,endingincolumn9oftherecord:- Thisnumberoccursinthepor)onoftheentrydescribingtheactualnucleo)desequenceanddesignatesthenumberingofsequenceposi)ons
● twoslashes(//)inposi)ons1and2,markingtheendofanentry
BioinfRes SoSe 16
Part2● Thesecondpartofeachsequenceentryrecordcontainstheinforma)onappropriatetoitskeyword
● inposi)ons13to80forkeywords
● inposi)ons11to80forthesequence
BioinfRes SoSe 16
EntryFieldTypes(incomplete)● Locus:Ashortmnemonicnamefortheentry,chosentosuggestthesequence'sdefini)on;mandatorykeyword/exactlyonerecord.
● Defini4on:Aconcisedescrip)onofthesequence;mandatorykeyword/oneormorerecords
● Accession:- theprimaryaccessionnumberisaunique,unchangingiden4fierassignedtoeachGenBanksequencerecord.
- tobeusedforcita)onsfromGenBank- mandatorykeyword/oneormorerecords.
BioinfRes SoSe 16
EntryFieldTypes(incomplete)
● Version:- compoundiden)fierconsis)ngoftheprimaryaccessionnumberandanumericversionnumberassociatedwiththecurrentversionofthesequencedataintherecord
- op)onallyfollowedbyanintegeriden)fier(a"GI")assignedtothesequencebyNCBI
- mandatorykeyword/exactlyonerecord
BioinfRes SoSe 16
EntryFieldTypes(incomplete)
● DBLINK:providescross-referencestoresourcesthatsupporttheexistenceasequencerecord;op4onalkeyword/oneormorerecords
● Keywords:shortphrasesdescribinggeneproductsandotherinforma)onaboutanentry;mandatorykeywordinallannotatedentries/oneormorerecords
BioinfRes SoSe 16
EntryFieldTypes(incomplete)
● Source:Commonnameoftheorganismorthenamemostfrequentlyusedintheliterature;mandatorykeywordinallannotatedentries/oneormorerecords/includesonesubkeyword
● Organism:Formalscien)ficnameoftheorganism(firstline)andtaxonomicclassifica)onlevels(secondandsubsequentlines);mandatorysubkeywordinallannotatedentries/twoormorerecords
BioinfRes SoSe 16
EntryFieldTypes(incomplete)● Reference:- Cita)onsforallar)clescontainingdatareportedinthisentry
- includessevensubkeywordsandmayrepeat- mandatorykeyword/oneormorerecords
● Journal:liststhejournalname,volume,year,andpagenumbersofthecita)on;mandatorysubkeyword/oneormorerecords
● op)onalsubkeywords:Authors,Consor)um,Title,Medline,Pubmed,Remark
BioinfRes SoSe 16
EntryFieldTypes(incomplete)● Features:tablecontaininginforma)ononpor)onsofthesequencethatcodeforproteinsandRNAmolecules;sitesofbiologicalsignificance;op4onalkeyword/oneormorerecords
● Origin:- specifica)onofhowthefirstbaseofthereportedsequenceisopera)onallylocatedwithinthegenome
- mandatorykeyword/exactlyonerecord- followedbysequencedata(mul)plerecords)
● //:entrytermina)onsymbol;mandatoryattheendofanentry/exactlyonerecord
BioinfRes SoSe 16
DetailedLocusFormatColumns Contents 01-05 'LOCUS'
06-12 spaces
13-28 Locus name
29-29 space
30-40 Length of sequence, right-justified
41-41 space
42-43 bp
44-44 space
45-47 spaces, ss- (single-stranded), ds- (double-stranded), or ms- (mixed-stranded)
48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), mRNA (messenger RNA), uRNA (small nuclear RNA), left justified
54-55 space
56-63 'linear' followed by two spaces, or 'circular'
64-64 space
65-67 The division code
68-68 space
69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
BioinfRes SoSe 16
AccessionFormat● sixoreightcharacters● sixcharacterformat:- singleuppercaseleaer- 5digits
● eigthcharacterformat:- twouppercaseleaers- 6digits
● primaryaccessionnumberalwaysthefirstone
BioinfRes SoSe 16
Features(Incomplete)
● authorita)vesource:hap://www.insdc.org/documents/feature-table
● featuretablecontainsinforma)onabout:- geneandgeneproducts- regionsofbiologicalsignificance- canenumeratedifferencesbetweenvariousreports- providescross-referencestootherdatacollec)ons- allowshierarchicalrela)onbetweenthefeatures
BioinfRes SoSe 16
Layout● firstlineofthefeaturetableisaheader● includesthekeyword‘FEATURES’andthecolumnheader‘Loca)on/Qualifiers’
● eachfeatureconsistsof:- descriptorlinecontainingafeaturekeyandaloca)on
- acon)nua)onlinefortheloca)onmayfollow- featurequalifiersmayfollowthedescriptorline- key:column6-20,loca)onstartsincolumn22- qualifiersonsubsequentlinesatcolumn22star)ngwitha‘/’
BioinfRes SoSe 16
AFewFrequentFeatures● CDS:sequencecodingforaminoacidsinprotein(includesstopcodon)
● exon:regionthatcodesforpartofsplicedmRNA● gene:regionthatdefinesafunc)onalgene,possiblyincludingupstream(promotor,enhancer,etc)anddownstreamcontrolelements,andforwhichanamehasbeenassigned
● mRNA:messengerRNA
● .......>60featurescurrently
BioinfRes SoSe 16
Loca)onandQualifiers
● Loca)on:- aloca)oncanbe:asinglebase,aspanofbases,asitebetweentwobases,ajoinofsequences,...
- examples:23,23..56,23^24,join(23..56,87..110)
● Qualifiers:- format:fromcolumn22/qualifier_name[=value]- types:freetext,enumera)onorcontrolledvocabulary,cita)ons,sequences,featurelabels
BioinfRes SoSe 16
DatabaseCrossReferences/db_xref
● hap://www.ncbi.nlm.nih.gov/genbank/collab/db_xref/
● Qualifier:/db_xref="database:idenDfier”● Defini4on:databasecross-reference:pointertorelatedinforma)oninanotherdatabase
● Scope:allfeaturekeys● Example:/db_xref="Swiss-Prot:P12345”
● currently>120databasesavailable
BioinfRes SoSe 16
AnatomyofaGenbankFlatFile
. . .
BioinfRes SoSe 16
AnatomyofaGenbankFlatFile
. . .
Locus line
BioinfRes SoSe 16
AnatomyofaGenbankFlatFile
. . . Accession Number, Version and GI number
BioinfRes SoSe 16
AnatomyofaGenbankFlatFile
. . . Feature table with annotations
BioinfRes SoSe 16
UsefulResourcesfromNCBI
● Materials:● Electronicbookshelf
● hap://www.ncbi.nlm.nih.gov/educa)on/factsheets/
● jp://jp.ncbi.nih.gov/pub/factsheets/Factsheet_Books.pdf
● NCBImanuals
● textbooks
BioinfRes SoSe 16
UsefulResourcesfromNCBI
● Processes,e.g.Prokaryo)cGenomeAnnota)onPipeline
● designedforbacterialandarchaealgenomes● mul)-levelprocessincludingprotein-codinggenepredic)onandfunc)onalgenomeunitlikerRNAs,tRNAs,smallRNAs,pseudogenescontrolregions,repeats,inser)onelementsa.s.f.
● combina)onofab-iniDopredic)onandhomologybasedmethods
BioinfRes SoSe 16
UsefulResourcesfromNCBI● referencedatabases:RefSeq● hap://www.ncbi.nlm.nih.gov/refseq/
● comprehensive,integrated,non-redundant,well-annotatedsetofsequences,includinggenomicDNA,transcripts,andproteins
● stablereferenceforgenomeannota)on,esp.subsetofRefSeqGene
● referencesequences
● referencecoordinates● accessibleviaBLAST,EntrezandFTP
BioinfRes SoSe 16
RefSeq● createdby:- Eukaryo)cGenomeAnnota)onPipeline- Prokaryo)cGenomeAnnota)onPipeline- Manualcura)on- SubmissiontoINSDCmembers
● reflectcurrentknowledgeofsequencesdataandbiology
● formatconsistency● Accessionnumbercontainsan“_”
BioinfRes SoSe 16
RefSeqGrowth
BioinfRes SoSe 16
DatabasesAccessibleviaEntrez
http://www.ncbi.nlm.nih.gov/gquery/
BioinfRes SoSe 16
Computa)on:BlastatNCBI
BioinfRes SoSe 16
BioinfRes SoSe 16
BioinfRes SoSe 16
BioinfRes SoSe 16
BioinfRes SoSe 16
SearchingtheNCBI/Entrez● provideanintegratedsearchinterfacetothedifferentNCBIdatabases:EntrezProgrammingU)li)es(E-u)li)es)
● Base-URL:hap://eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/
● >40databases
● stableinterfaceofnineserver-sideprograms
● hap://www.ncbi.nlm.nih.gov/books/NBK25501/
BioinfRes SoSe 16
EntrezGuidelines● ifyouusetheeu)lsagainsttheguidelinesyoumightbebanned!
● >100requests:weekendsoroutsideUSpeak)mes(9pm-5am,EST)
● notmorethan3requestpersecond
● provideemailandtoolname:&tool=&email=!
● registra)onwithemailandtoolnamewithNCBImayrelaxtheserestric)ons
● supportedbyBioPython
BioinfRes SoSe 16
Construc)ngURLs
● parameter:&lowerCaseName● excep)on:&WebEnv
● norequiredorder
● nullvaluesandinappropriateparameteraregenerallyignored
● nospaces,use+instead
● useURLencodingsforspecialcharacterlike:%22for“or%23for#or%40for@
BioinfRes SoSe 16
E-u)li)es● Einfo● Esearch
● EPost
● ESummary● EFetch
● ELink
● EGQuery
● ESpell● ECitMatch
BioinfRes SoSe 16
ESearch
● textsearch● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esearch.fcgi
● respondstoatextquerywiththelistofmatchingUIDsinagivendatabase(forlateruseinESummary,EFetchorELink),alongwiththetermtransla)onsofthequery
BioinfRes SoSe 16
ESummary
● documentsummarydownloads● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esummary.fcgi
● respondstoalistofUIDsfromagivendatabasewiththecorrespondingdocumentsummaries
BioinfRes SoSe 16
EGQuery
● globalquery● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/egquery.fcgi
● respondstoatextquerywiththenumberofrecordsmatchingthequeryineachEntrezdatabase
BioinfRes SoSe 16
EInfo
● databasesta)s)cs● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/einfo.fcgi
● providesthenumberofrecordsindexedineachfieldofagivendatabase,thedateofthelastupdateofthedatabase,andtheavailablelinksfromthedatabasetootherEntrezdatabases
● without&db:listsallavailabledatabases
BioinfRes SoSe 16
EFetch
● datarecorddownloads● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/efetch.fcgi
● respondstoalistofUIDsinagivendatabasewiththecorrespondingdatarecordsinaspecifiedformat
BioinfRes SoSe 16
ELink
● Entrezlinks● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/elink.fcgi
● respondstoalistofUIDsinagivendatabasewitheitheralistofrelatedUIDs(andrelevancyscores)inthesamedatabaseoralistoflinkedUIDsinanotherEntrezdatabase
BioinfRes SoSe 16
ELink
● checksfortheexistenceofaspecifiedlinkfromalistofoneormoreUIDs
● createsahyperlinktotheprimaryLinkOutproviderforaspecificUIDanddatabase,orlistsLinkOutURLsandaaributesformul)pleUIDs
BioinfRes SoSe 16
EPost
● UIDuploads● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/epost.fcgi
● acceptsalistofUIDsfromagivendatabase,storesthesetontheHistoryServer,andrespondswithaquerykeyandwebenvironmentfortheuploadeddataset
BioinfRes SoSe 16
ESpell
● spellingsugges)ons● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/espell.fcgi
● retrievesspellingsugges)onsforatextqueryinagivendatabase
BioinfRes SoSe 16
ECitMatch
● batchcita)onsearchinginPubMed● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/ecitmatch.cgi
● retrievesPubMedIDs(PMIDs)correspondingtoasetofinputcita)onstrings
BioinfRes SoSe 16
Iden)ficators● recordsareiden)fiedbyanintegerIDcalledUID● UIDaredatabasespecificlikeGInumbers,PMIDS,MMDB-IDs
● UIDareaswellinputandoutput
● especiallyusefulincombina)onwiththeHistoryserver
● afulldescrip)onofparametersandsyntaxcanbefoundat:hap://www.ncbi.nlm.nih.gov/books/NBK25499/
BioinfRes SoSe 16
SelectedUIDsEntrez Database UID common name E-utility Database Name Books Book ID books Conserved Domains PSSM-ID cdd dbVar dbVar ID dbvar EST GI number nucest Gene Gene ID gene Genome Genome ID genome MeSH MeSH ID mesh NCBI Web Site Web Site ID ncbisearch Nucleotide GI number nuccore PubMed PMID pubmed ... ... ...
BioinfRes SoSe 16
EntrezCoreEngine● EGQuery,ESearch,andESummary● twotasks:- assemblealistofUIDsthatmatchatextquery(ESearch)- retrieveabriefsummaryrecordcalledaDocumentSummary(DocSum)foreachUIDESummary)
● EGQuey:globalversionofESearch● esearch.fcgi?db=database&term=query esummary.fcgi?db=database&id=uid1,uid2,uid3,...!
● expandedintomorecomplicatedEntrezqueries
BioinfRes SoSe 16
EntrezDatabases(EInfo,EFetch,andELink)
● EInfo:- providesdetailedinforma)onabouteachdatabase- includinglistsoftheindexingfieldsinthedatabase- availablelinkstootherEntrezdatabases
BioinfRes SoSe 16
EntrezDatabases(EInfo,EFetch,andELink)
● addedvaluetotherawdata:- supportsavarietyofdisplayformats:EFetchUIDlistsinXMLandplaintext(&retmode)foralldatabases,otherformats(&rettype)aredatabasespecific
- hap://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly
- efetch.fcgi?db=database&id=uid1,uid2,uid3 &rettype=report_type&retmode=data_mode!
BioinfRes SoSe 16
EntrezDatabases(EInfo,EFetch,andELink)
● addedvaluetotherawdata:- linkstorecordsinotherEntrezdatabasesmanifestedaslistofassociatedUIDs
- UIDsmustbevalidinsourcedatabase(&dbfrom)- elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902
BioinfRes SoSe 16
EntrezHistoryServer
● simple:intheGUIaccessibleviatherespec)vetabs
● youcanstoretemporarilysetsofUIDsasinputforlaterqueriesthroughothertools
● eachlistofUIDsisspecifiedby:- &query_key(integerlabel)- &WebEnv(cookiestring)
BioinfRes SoSe 16
Crea)onofastoredUIDlist
● EPost:- EPostcanbeuseduploadaUIDlist- returns&query_keyand&WebEnv!
● ESearch:- storestheresultsifgiven&usehistory=y!
● ELink:- storestheresultsifgiven&cmd=neighbor_history!
BioinfRes SoSe 16
UsageofstoredUIDlists● Useofstoredlists:esummary.fcgi?db=database&WebEnv=webenv &query_key=key!
● onewebenvironmentcanholdmul)pleresultlists
● listsinthesamewebenvironmentcanbecombinedwithAND,OR,NOT
● bydefaulteverycallcreatesanewenvironment
● ->give&WebEnvinsubsequentcallstostorethelistsinthesamewebenvironment
BioinfRes SoSe 16
SketchingPipelines
● getDocSummariesorentriesforkeywordsorIDs:- ESearch->ESummary/EFetch- EPost->ESummary/EFetch
● filter/limitarecordset:- EPost/ELink->ESearch
● moreadvancedqueries:- ESearch->ELink->ESummary/EFetch- EPost->ELink->ESearch->EFetch
BioinfRes SoSe 16
E-u)lityWebinar
● haps://www.youtube.com/watch?v=iCFVVexp30o