Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DatabasessearchedwithBLAST
BasicBioinformaticsTraining2017ILRI,AddisAbaba- Ethiopia
Dec11– 15,2017
JoyceNzioki
Whydatabasesearches?
üGenefindingü Assigninglikelyfunctiontoageneü IdentifyingregulatoryelementsüUnderstandinggenomeevolutionü Findingrelationsbetweengenesü Assistingingenomeassemblyü Identifythedomainswithsequence,structure&functionalsimilarity
SequencecomparisonüNewlysequencedDNAdataiscomparedtothatalreadyavailableinbiologicaldatabases.
ü Sequencecomparison(ofDNAorProteindata)isachievedthroughalignment,theprocessbywhichregionsofsimilarityissearchedbetweensequences.
ü Thiseasesannotationofnewsequencesasbiologicalknowledgefromwellcharacterizedhomologs canbeconferred
Someterminology1. Sequencesimilarity;thisiswhentwosequencesareveryalikein
basepairoraminoacidsequenceü StatisticalmeasureslikeE-value.P-Valueandbitscoreü Percentageidentity(%ofidenticalresiduesbetweensequences)ü Thelengthofsequencestretchthatissimilar
Someterminology1. Sequencesimilarity;thisiswhentwosequencesareveryalikein
basepairoraminoacidsequenceü StatisticalmeasureslikeE-value.P-Valueandbitscoreü Percentageidentity(%ofidenticalresiduesbetweensequences)ü Thelengthofsequencestretchthatissimilar
2. Homology;homologsdiversefromacommonancestorandhomologyisinferredbysequence,structuralandfunctionalsimilarityü Orthologs– ariseduetoaspeciationeventü Paralogs– ariseduetogeneduplicationwithinthesequence
Homology1. Ortholog;(orthologousgenes)genesindifferentspeciesthatare
simillar toeachotherinbecausetheyoriginatefromacommonancestor;samefunctionthroughoutevolution.
2. Paralog;(paralogousgene)geneswithinaspeciesthataresimilartoeachotherbecausetheyarosefromaduplicationevent;mayevolvenewfunctionthroughevolutionarytime
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/orthologs3.gif
SequenceAlignment
Whatisit..?
ACCGGTATCCTAGGACACCTATCTTAGGAC
Arethesetwosequencesrelated?Howsimilarordissimilararethey?
Whatisit..?
ACCGGTATCCTAGGAC| ||||||||||||
ACC- - TATCTTAGGAC
Matchthetwosequencesascloselyaspossible=alignedAlignmentsneedascore
SequenceAlignmentTheretwotypesofalignmentstrategies1. Globalalignment;findoptimalalignmentoverthe
entirelengthofthetwocomparedsequences
ü Bestforhighlysimilarsequenceü Searchformutationsandpolymorphismsinasequencescomparedtoareference.
SequenceAlignment2. Localalignment;alignsshortregionsofsimilarity
betweensequences.
üUsefulwhenlookingfordomainsinproteinsandgenefindinginDNA
ü Bestforsequencesoflowsimilarityanddifferentlength
Howdowescorealignments
ACCGGTATCCTAGGAC| ||||||||||||ACC- - TATCTTAGGAC
AssignascoreforeachmatchAssignascore(penalty)foreachsubstitutionAssignascore(penalty)foreachinsertionanddeletion(gaps).Toassignmatchesandsubstitutionsweusesubstitutionmatrices(discussedlater)
Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.
Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.
ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.
Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.
ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.
ü Ascoringprocedureisimplementedonsearchestomeasurethedegreeofsimilarity.
Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.
ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.
ü Ascoringprocedureisimplementedonsearchestomeasurethedegreeofsimilarity.
ü Judgmentneedstobemadeonwhetherthesimilarsequencesarehomologoustoyourquerybasedonscientificknowledge
Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.
ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.
ü Ascoringprocedureisimplementedonsearchestomeasurethedegreeofsimilarity.
ü Judgmentneedstobemadeonwhetherthesimilarsequencesarehomologoustoyourquerybasedonscientificknowledge
ü There2programsforthis:1. BLAST (Altschuletal.1990)2. FastA (PearsonandLipman1988)
BasicLocalAlignmentSearchTool
üBlastisaheuristicapproachusedtocalculatesimilarityforbiologicalsequences
ü Itfindsbestlocalalignment(smith-watermanalgorithim)
üBlastdisplaysresultsasalistofsequencematchesorderedbystatisticalsignificance(hits)
üBLASTisusefulinidentifyingunknownsequencesaswellasgeneorproteinfunctionprediction.
SomeBLASTterminologyWord – substringofasequenceWordpair – pairsofwordsofthesamelengthQuery – sequencetobesearchedagainstthedatabaseHit – matchestoqueryinthedatabaseScoreofawordpair– scoreofthegaplessalignmentoftwowords
VALMRVAKNSscore=4+2+-3+-2+0=1(PAM250)
HSP – highscoringpair
HowdoesBLASTwork• Parameters: w=lengthofahit;T=minscoreofahit(forproteinsw=3,T=13(BLOSUM62)).
• Step1:GivenaquerysequenceQ,sequenceissplitintowordsofdefinedlength.
• Step2:Scanthedatabaseforexactmatchingwiththelistofwordscompiledinstep1.
• Step3:Pairwisealignmentbyextendinghitsfromstep2inbothdirections.
• Step4: Evaluationsignificanceofextendedhitsfromstep3
• Step5:ReturnsequenceswithHSPwhichhavestatisticallysignificantscores.
Step1:FindhighscoringwordsNucleotidequeryissplitintowordsofdefinedlength
GTAAAATCATCAT(w=11)GTAAAATCATCTAAAATCATCAAAAATCATCAT
Theproteinqueryissplitintowords.Foreverwordx oflengthw inQmakealistofwordsthatwhenalignedtoxscoreatleastTthresholdforhits.
PDERTYHI(w=3)PDE DER ERT RTY TYH YHIPDD DDR ERY RTF TYY YHIPDN EDR ERF RFY TTH THV
ExampleLetx=PDE thenscoreforPDA=5+5+0 (dropped)andforPDD=5+5+4 (taken).NumberofwordsdependonwandT
Step1:FindhighscoringwordsMVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDISQueryword– GSK,W=3,cutoffwordsbyT=11WordScore(BLOSSUM-62)
GSK 15GAK 12GNK 12GTK 12GSR 12GDK 11GQK 11GEK 11GGQ 11GKA 10 ThresholdscoreT=11
Step2:FindinghitsThedatabaseissearchedforexactmatchesofwords.
GTAAAATCAAGTCCAGTATGACCTTCAAGTCCA
MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDIS
Queryword– GSK,W=3,T=11Query1 MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRD 40
MVRERKCILCHI++GS+KEMDEHMRSMLHHRELENLKGR+Sbjct 1 MVRERKCILCHIIHGSEKEMDEHMRSMLHHRELENLKGRE 40
Step3:Extendinghits• Parameter:X (controlledbyuser)• Extendthehitsbypairwisealignmentinbothdirections.
• Extensioncontinues,tillsomepointmismatchesandgapsdropthescorebelowthresholdX extensionisterminated.
Query1 MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDMVRERKCILCHI++GS+KEMDEHMRSMLHHRELENLKGR+
Sbjct 1 MVRERKCILCHIIHGSEKEMDEHMRSMLHHRELENLKGRE
Step4&5:ReturningHSPs
• BLASTreturnsthescorehighestscoringsegmentpair(HSP)
• Isthescorehighenoughtoprovideevidenceofhomology?
• Arethescoresofalignmentsofrandomsequenceshigherthanthisscore?
ScoringsystemNucleotides
ScoringsystemsPROTEINS1. BLOSUM (BLOcks ofaminoacidSubstitutionMAtrix)
üBLOSUMmatrixisderivedfromconservedun-gappedblocksofproteinsequencealignments.
üBLOSUMmatricesaredirectlycalculatedacrossvaryingevolutionarydistance:
üBLOSUM-45representssequenceswith45%identityüBLOSUM-80representssequenceswith80%identityüThehighertheBLOSUMmatrixthemorecloselyrelatedthesequencesshouldbe
ScoringsystemsPROTEINS2. PAM (PointAcceptedMutations)• Pointacceptedmutationbasedonthefactthataminoacidsofthesamesize,chargeorhydrophobicityarelikelytobesubstitutedforeachother
• PAM-1matrixrepresentsanaveragechangeof1%(1substitutionin100residues).
• PAM-1wouldbesuitableforverycloselyrelatedsequences
• PAM-250=250mutationsin100residues.• HencehigherPAMmatricesareusedforsequencesofgreaterevolutionarydistance
BlastFlavorsSome Flavors of BLAST
ucleotide rotein N
NN
N
N
N
P
P
blastx
tblastn
tblastx
P P P P P P
P P P P P P P P P P P P
P P P P P P
Query Database Program
blastp
blastn
P P N N
PP
GraphicalBLASTresults
ü ThisisagraphicalviewofthedistributionofBLASThitsonthequerysequence
ü Thelengthofthehitsshowsthequerycoverageandregionsofsimilarity
QuerySequence
BlastHits,Amouseovergivesyouthedetails.Clicktoviewalignment
HitListBLASTresultHitlistgivestheidentifyofsequencessimilartoyourquerysequencerankedbysimilarity
Bitscorevalues<50unreliable
%Querycoverage
E-value
Sequencedefinition,clicktoviewthepairwisealignment
Accessionnumber,LinktotherecordinGenBank
Pairwisealignment(Protein)
”AA":identical.”+":conservedsubstitutions”blank":semi-conservedsubstitution(similarshapes).
Pairwisealignment(nucleotide)
BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:
BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:• E-value =Expectedvalue.(thisindicatestheprobabilitythattheblasthitmayhaveoccurredbyrandomchance).
• ThelowertheE-value(orthecloseritisto0)themoresignificantthehit.TobecertainofhomologyyourE-valuemustbebelow10-4 or0.001.
BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:• E-value =Expectedvalue.(thisindicatestheprobabilitythattheblasthitmayhaveoccurredbyrandomchance).
• ThelowertheE-value(orthecloseritisto0)themoresignificantthehit.TobecertainofhomologyyourE-valuemustbebelow10-4 or0.0001.
• %identitythehighertheidentitytheincreasinglikelihoodofhomology.(nucl>=70%,prot>=25%)
BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:• E-value =Expectedvalue.(thisindicatestheprobabilitythattheblasthitmayhaveoccurredbyrandomchance).
• ThelowertheE-value(orthecloseritisto0)themoresignificantthehit.TobecertainofhomologyyourE-valuemustbebelow10-4 or0.0001.
• %identitythehighertheidentitytheincreasinglikelihoodofhomology.
• Querycoverage – ifahithashighquerycoverageandsimilarityinincreasesthechancesofhomology.
Summaryfornucleotidesequence
Length Database Purpose BLASTProgram
20bp orlonger Nucleotide Identifythequerysequence
blastnmegablast
Find similarnucleotidesequence.
blastn
Findsimilarproteinstotranslated queryinatranslatednucleotidedatabase
tblastx
Protein Find proteinscodedinmyqueryDNAsequence
blastx
SummaryforproteinsequencesLength Database Purpose BLASTProgram15residuesorLonger
Protein Identifyyourquerysequence orfindproteinsequencessimilartoit
blastp
Findmembers ofaproteinfamilyorbuildacustompositionspecificscoringmatrix(PSSMs)
PSI-blast
Findproteins similartothequeryaroundagivenpattern
PHI-blast
Conserveddomains
Findconserveddomainsinyourqueryandidentifyother proteinswithsimilardomains
CD-search
Nucleic Findsimilarsequencesinatranslatednucleotide sequencedatabase.
tblastn
BLASTPracticalTheblastserverisavailableathttps://blast.ncbi.nlm.nih.gov/Blast.cgiBlastthebelowunknownsequence
>unknownsequencesGGCATGAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCCACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACCCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCAGTCATAGGAAGGGGAGAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATT
WhatistheidentityofthesequenceWhatisthe%similarity,E-valueandQuerycoverageArethestatisticsindicativeofhomologyTryblastx onthesequence
BLASTPracticalBlastthebelowunknownproteinsequence(useBLASTP)
>unknownproteinMASSVISSAAVATRTNVTQAGSMIAPFTGLKSAATFPVSRKQNLDITSIASNGGRVRCMQVWPPINTKKYETLSYLPDLTDEQLLKEVEYLLKNGWVPCLEFETEHGIVYREKHKSPGYYDGRYWNMWKLPMFGCTDATQVLAEVQECKKSYPQAWIRIIGFDNVRQVQCISFIAYKPEGY
• Canyoutellmewhatitis?• Letsdoaproteinlevelsearchagainstgreenplants(Viridiplantae)• Andatranslationalsearchagainstgreenplantsusingtblastn• Lookattheconserveddomains
FurtherpracticeonBLAST
• https://digitalworldbiology.com/tutorial/blast-for-beginners
Thankyou
BasicBioinformaticsTraining2017ILRI,AddisAbaba- Ethiopia
Dec11– 15,2017