Databases searched with BLASThpc.ilri.cgiar.org/beca/training/ilri_addis/Blast.pdf · 2017. 12. 14. · BLAST result interpretation •How do you make your conclusion on homology:

DatabasessearchedwithBLAST

BasicBioinformaticsTraining2017ILRI,AddisAbaba- Ethiopia

Dec11– 15,2017

JoyceNzioki

Whydatabasesearches?

üGenefindingü Assigninglikelyfunctiontoageneü IdentifyingregulatoryelementsüUnderstandinggenomeevolutionü Findingrelationsbetweengenesü Assistingingenomeassemblyü Identifythedomainswithsequence,structure&functionalsimilarity

SequencecomparisonüNewlysequencedDNAdataiscomparedtothatalreadyavailableinbiologicaldatabases.

ü Sequencecomparison(ofDNAorProteindata)isachievedthroughalignment,theprocessbywhichregionsofsimilarityissearchedbetweensequences.

ü Thiseasesannotationofnewsequencesasbiologicalknowledgefromwellcharacterizedhomologs canbeconferred

Someterminology1. Sequencesimilarity;thisiswhentwosequencesareveryalikein

basepairoraminoacidsequenceü StatisticalmeasureslikeE-value.P-Valueandbitscoreü Percentageidentity(%ofidenticalresiduesbetweensequences)ü Thelengthofsequencestretchthatissimilar

Someterminology1. Sequencesimilarity;thisiswhentwosequencesareveryalikein

basepairoraminoacidsequenceü StatisticalmeasureslikeE-value.P-Valueandbitscoreü Percentageidentity(%ofidenticalresiduesbetweensequences)ü Thelengthofsequencestretchthatissimilar

2. Homology;homologsdiversefromacommonancestorandhomologyisinferredbysequence,structuralandfunctionalsimilarityü Orthologs– ariseduetoaspeciationeventü Paralogs– ariseduetogeneduplicationwithinthesequence

Homology1. Ortholog;(orthologousgenes)genesindifferentspeciesthatare

simillar toeachotherinbecausetheyoriginatefromacommonancestor;samefunctionthroughoutevolution.

2. Paralog;(paralogousgene)geneswithinaspeciesthataresimilartoeachotherbecausetheyarosefromaduplicationevent;mayevolvenewfunctionthroughevolutionarytime

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/orthologs3.gif

SequenceAlignment

Whatisit..?

ACCGGTATCCTAGGACACCTATCTTAGGAC

Arethesetwosequencesrelated?Howsimilarordissimilararethey?

Whatisit..?

ACCGGTATCCTAGGAC| ||||||||||||

ACC- - TATCTTAGGAC

Matchthetwosequencesascloselyaspossible=alignedAlignmentsneedascore

SequenceAlignmentTheretwotypesofalignmentstrategies1. Globalalignment;findoptimalalignmentoverthe

entirelengthofthetwocomparedsequences

ü Bestforhighlysimilarsequenceü Searchformutationsandpolymorphismsinasequencescomparedtoareference.

SequenceAlignment2. Localalignment;alignsshortregionsofsimilarity

betweensequences.

üUsefulwhenlookingfordomainsinproteinsandgenefindinginDNA

ü Bestforsequencesoflowsimilarityanddifferentlength

Howdowescorealignments

ACCGGTATCCTAGGAC| ||||||||||||ACC- - TATCTTAGGAC

AssignascoreforeachmatchAssignascore(penalty)foreachsubstitutionAssignascore(penalty)foreachinsertionanddeletion(gaps).Toassignmatchesandsubstitutionsweusesubstitutionmatrices(discussedlater)

Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.


ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.



ü Ascoringprocedureisimplementedonsearchestomeasurethedegreeofsimilarity.




ü Judgmentneedstobemadeonwhetherthesimilarsequencesarehomologoustoyourquerybasedonscientificknowledge




ü Judgmentneedstobemadeonwhetherthesimilarsequencesarehomologoustoyourquerybasedonscientificknowledge

ü There2programsforthis:1. BLAST (Altschuletal.1990)2. FastA (PearsonandLipman1988)

BasicLocalAlignmentSearchTool

üBlastisaheuristicapproachusedtocalculatesimilarityforbiologicalsequences

ü Itfindsbestlocalalignment(smith-watermanalgorithim)

üBlastdisplaysresultsasalistofsequencematchesorderedbystatisticalsignificance(hits)

üBLASTisusefulinidentifyingunknownsequencesaswellasgeneorproteinfunctionprediction.

SomeBLASTterminologyWord – substringofasequenceWordpair – pairsofwordsofthesamelengthQuery – sequencetobesearchedagainstthedatabaseHit – matchestoqueryinthedatabaseScoreofawordpair– scoreofthegaplessalignmentoftwowords

VALMRVAKNSscore=4+2+-3+-2+0=1(PAM250)

HSP – highscoringpair

HowdoesBLASTwork• Parameters: w=lengthofahit;T=minscoreofahit(forproteinsw=3,T=13(BLOSUM62)).

• Step1:GivenaquerysequenceQ,sequenceissplitintowordsofdefinedlength.

• Step2:Scanthedatabaseforexactmatchingwiththelistofwordscompiledinstep1.

• Step3:Pairwisealignmentbyextendinghitsfromstep2inbothdirections.

• Step4: Evaluationsignificanceofextendedhitsfromstep3

• Step5:ReturnsequenceswithHSPwhichhavestatisticallysignificantscores.

Step1:FindhighscoringwordsNucleotidequeryissplitintowordsofdefinedlength

GTAAAATCATCAT(w=11)GTAAAATCATCTAAAATCATCAAAAATCATCAT

Theproteinqueryissplitintowords.Foreverwordx oflengthw inQmakealistofwordsthatwhenalignedtoxscoreatleastTthresholdforhits.

PDERTYHI(w=3)PDE DER ERT RTY TYH YHIPDD DDR ERY RTF TYY YHIPDN EDR ERF RFY TTH THV

ExampleLetx=PDE thenscoreforPDA=5+5+0 (dropped)andforPDD=5+5+4 (taken).NumberofwordsdependonwandT

Step1:FindhighscoringwordsMVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDISQueryword– GSK,W=3,cutoffwordsbyT=11WordScore(BLOSSUM-62)

GSK 15GAK 12GNK 12GTK 12GSR 12GDK 11GQK 11GEK 11GGQ 11GKA 10 ThresholdscoreT=11

Step2:FindinghitsThedatabaseissearchedforexactmatchesofwords.

GTAAAATCAAGTCCAGTATGACCTTCAAGTCCA

MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDIS

Queryword– GSK,W=3,T=11Query1 MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRD 40

MVRERKCILCHI++GS+KEMDEHMRSMLHHRELENLKGR+Sbjct 1 MVRERKCILCHIIHGSEKEMDEHMRSMLHHRELENLKGRE 40

Step3:Extendinghits• Parameter:X (controlledbyuser)• Extendthehitsbypairwisealignmentinbothdirections.

• Extensioncontinues,tillsomepointmismatchesandgapsdropthescorebelowthresholdX extensionisterminated.

Query1 MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDMVRERKCILCHI++GS+KEMDEHMRSMLHHRELENLKGR+

Sbjct 1 MVRERKCILCHIIHGSEKEMDEHMRSMLHHRELENLKGRE

Step4&5:ReturningHSPs

• BLASTreturnsthescorehighestscoringsegmentpair(HSP)

• Isthescorehighenoughtoprovideevidenceofhomology?

• Arethescoresofalignmentsofrandomsequenceshigherthanthisscore?

ScoringsystemNucleotides

ScoringsystemsPROTEINS1. BLOSUM (BLOcks ofaminoacidSubstitutionMAtrix)

üBLOSUMmatrixisderivedfromconservedun-gappedblocksofproteinsequencealignments.

üBLOSUMmatricesaredirectlycalculatedacrossvaryingevolutionarydistance:

üBLOSUM-45representssequenceswith45%identityüBLOSUM-80representssequenceswith80%identityüThehighertheBLOSUMmatrixthemorecloselyrelatedthesequencesshouldbe

ScoringsystemsPROTEINS2. PAM (PointAcceptedMutations)• Pointacceptedmutationbasedonthefactthataminoacidsofthesamesize,chargeorhydrophobicityarelikelytobesubstitutedforeachother

• PAM-1matrixrepresentsanaveragechangeof1%(1substitutionin100residues).

• PAM-1wouldbesuitableforverycloselyrelatedsequences

• PAM-250=250mutationsin100residues.• HencehigherPAMmatricesareusedforsequencesofgreaterevolutionarydistance

BlastFlavorsSome Flavors of BLAST

ucleotide rotein N

NN

N

N

N

P

P

blastx

tblastn

tblastx

P P P P P P

P P P P P P P P P P P P

P P P P P P

Query Database Program

blastp

blastn

P P N N

PP

GraphicalBLASTresults

ü ThisisagraphicalviewofthedistributionofBLASThitsonthequerysequence

ü Thelengthofthehitsshowsthequerycoverageandregionsofsimilarity

QuerySequence

BlastHits,Amouseovergivesyouthedetails.Clicktoviewalignment

HitListBLASTresultHitlistgivestheidentifyofsequencessimilartoyourquerysequencerankedbysimilarity

Bitscorevalues<50unreliable

%Querycoverage

E-value

Sequencedefinition,clicktoviewthepairwisealignment

Accessionnumber,LinktotherecordinGenBank

Pairwisealignment(Protein)

”AA":identical.”+":conservedsubstitutions”blank":semi-conservedsubstitution(similarshapes).

Pairwisealignment(nucleotide)

BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:

BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:• E-value =Expectedvalue.(thisindicatestheprobabilitythattheblasthitmayhaveoccurredbyrandomchance).

• ThelowertheE-value(orthecloseritisto0)themoresignificantthehit.TobecertainofhomologyyourE-valuemustbebelow10-4 or0.001.



• %identitythehighertheidentitytheincreasinglikelihoodofhomology.(nucl>=70%,prot>=25%)



• %identitythehighertheidentitytheincreasinglikelihoodofhomology.

• Querycoverage – ifahithashighquerycoverageandsimilarityinincreasesthechancesofhomology.

Summaryfornucleotidesequence

Length Database Purpose BLASTProgram

20bp orlonger Nucleotide Identifythequerysequence

blastnmegablast

Find similarnucleotidesequence.

blastn

Findsimilarproteinstotranslated queryinatranslatednucleotidedatabase

tblastx

Protein Find proteinscodedinmyqueryDNAsequence

blastx

SummaryforproteinsequencesLength Database Purpose BLASTProgram15residuesorLonger

Protein Identifyyourquerysequence orfindproteinsequencessimilartoit

blastp

Findmembers ofaproteinfamilyorbuildacustompositionspecificscoringmatrix(PSSMs)

PSI-blast

Findproteins similartothequeryaroundagivenpattern

PHI-blast

Conserveddomains

Findconserveddomainsinyourqueryandidentifyother proteinswithsimilardomains

CD-search

Nucleic Findsimilarsequencesinatranslatednucleotide sequencedatabase.

tblastn

BLASTPracticalTheblastserverisavailableathttps://blast.ncbi.nlm.nih.gov/Blast.cgiBlastthebelowunknownsequence

>unknownsequencesGGCATGAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCCACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACCCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCAGTCATAGGAAGGGGAGAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATT

WhatistheidentityofthesequenceWhatisthe%similarity,E-valueandQuerycoverageArethestatisticsindicativeofhomologyTryblastx onthesequence

BLASTPracticalBlastthebelowunknownproteinsequence(useBLASTP)

>unknownproteinMASSVISSAAVATRTNVTQAGSMIAPFTGLKSAATFPVSRKQNLDITSIASNGGRVRCMQVWPPINTKKYETLSYLPDLTDEQLLKEVEYLLKNGWVPCLEFETEHGIVYREKHKSPGYYDGRYWNMWKLPMFGCTDATQVLAEVQECKKSYPQAWIRIIGFDNVRQVQCISFIAYKPEGY

• Canyoutellmewhatitis?• Letsdoaproteinlevelsearchagainstgreenplants(Viridiplantae)• Andatranslationalsearchagainstgreenplantsusingtblastn• Lookattheconserveddomains

FurtherpracticeonBLAST

• https://digitalworldbiology.com/tutorial/blast-for-beginners

Thankyou

[email protected]

BasicBioinformaticsTraining2017ILRI,AddisAbaba- Ethiopia

Dec11– 15,2017

Documents

Databases searched with BLASThpc.ilri.cgiar.org/beca/training/ilri_addis/Blast.pdf · 2017. 12. 14. · BLAST result interpretation •How do you make your conclusion on homology: