46
Databases searched with BLAST Basic Bioinformatics Training 2017 ILRI, Addis Ababa- Ethiopia Dec 11 – 15, 2017 Joyce Nzioki

Databases searched with BLASThpc.ilri.cgiar.org/beca/training/ilri_addis/Blast.pdf · 2017. 12. 14. · BLAST result interpretation •How do you make your conclusion on homology:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • DatabasessearchedwithBLAST

    BasicBioinformaticsTraining2017ILRI,AddisAbaba- Ethiopia

    Dec11– 15,2017

    JoyceNzioki

  • Whydatabasesearches?

    üGenefindingü Assigninglikelyfunctiontoageneü IdentifyingregulatoryelementsüUnderstandinggenomeevolutionü Findingrelationsbetweengenesü Assistingingenomeassemblyü Identifythedomainswithsequence,structure&functionalsimilarity

  • SequencecomparisonüNewlysequencedDNAdataiscomparedtothatalreadyavailableinbiologicaldatabases.

    ü Sequencecomparison(ofDNAorProteindata)isachievedthroughalignment,theprocessbywhichregionsofsimilarityissearchedbetweensequences.

    ü Thiseasesannotationofnewsequencesasbiologicalknowledgefromwellcharacterizedhomologs canbeconferred

  • Someterminology1. Sequencesimilarity;thisiswhentwosequencesareveryalikein

    basepairoraminoacidsequenceü StatisticalmeasureslikeE-value.P-Valueandbitscoreü Percentageidentity(%ofidenticalresiduesbetweensequences)ü Thelengthofsequencestretchthatissimilar

  • Someterminology1. Sequencesimilarity;thisiswhentwosequencesareveryalikein

    basepairoraminoacidsequenceü StatisticalmeasureslikeE-value.P-Valueandbitscoreü Percentageidentity(%ofidenticalresiduesbetweensequences)ü Thelengthofsequencestretchthatissimilar

    2. Homology;homologsdiversefromacommonancestorandhomologyisinferredbysequence,structuralandfunctionalsimilarityü Orthologs– ariseduetoaspeciationeventü Paralogs– ariseduetogeneduplicationwithinthesequence

  • Homology1. Ortholog;(orthologousgenes)genesindifferentspeciesthatare

    simillar toeachotherinbecausetheyoriginatefromacommonancestor;samefunctionthroughoutevolution.

    2. Paralog;(paralogousgene)geneswithinaspeciesthataresimilartoeachotherbecausetheyarosefromaduplicationevent;mayevolvenewfunctionthroughevolutionarytime

    http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/orthologs3.gif

  • SequenceAlignment

    Whatisit..?

    ACCGGTATCCTAGGACACCTATCTTAGGAC

    Arethesetwosequencesrelated?Howsimilarordissimilararethey?

  • Whatisit..?

    ACCGGTATCCTAGGAC| ||||||||||||

    ACC- - TATCTTAGGAC

    Matchthetwosequencesascloselyaspossible=alignedAlignmentsneedascore

  • SequenceAlignmentTheretwotypesofalignmentstrategies1. Globalalignment;findoptimalalignmentoverthe

    entirelengthofthetwocomparedsequences

    ü Bestforhighlysimilarsequenceü Searchformutationsandpolymorphismsinasequencescomparedtoareference.

  • SequenceAlignment2. Localalignment;alignsshortregionsofsimilarity

    betweensequences.

    üUsefulwhenlookingfordomainsinproteinsandgenefindinginDNA

    ü Bestforsequencesoflowsimilarityanddifferentlength

  • Howdowescorealignments

    ACCGGTATCCTAGGAC| ||||||||||||ACC- - TATCTTAGGAC

    AssignascoreforeachmatchAssignascore(penalty)foreachsubstitutionAssignascore(penalty)foreachinsertionanddeletion(gaps).Toassignmatchesandsubstitutionsweusesubstitutionmatrices(discussedlater)

  • Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.

  • Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.

    ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.

  • Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.

    ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.

    ü Ascoringprocedureisimplementedonsearchestomeasurethedegreeofsimilarity.

  • Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.

    ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.

    ü Ascoringprocedureisimplementedonsearchestomeasurethedegreeofsimilarity.

    ü Judgmentneedstobemadeonwhetherthesimilarsequencesarehomologoustoyourquerybasedonscientificknowledge

  • Searchingsequencedatabaseü Aquerysequenceissearchedagainstadatabasetolookforhomologs.

    ü Thealgorithmusedaligns yourquerytothoseinthedatabaseandreturnshighlysimilarsequences.

    ü Ascoringprocedureisimplementedonsearchestomeasurethedegreeofsimilarity.

    ü Judgmentneedstobemadeonwhetherthesimilarsequencesarehomologoustoyourquerybasedonscientificknowledge

    ü There2programsforthis:1. BLAST (Altschuletal.1990)2. FastA (PearsonandLipman1988)

  • BasicLocalAlignmentSearchTool

    üBlastisaheuristicapproachusedtocalculatesimilarityforbiologicalsequences

    ü Itfindsbestlocalalignment(smith-watermanalgorithim)

    üBlastdisplaysresultsasalistofsequencematchesorderedbystatisticalsignificance(hits)

    üBLASTisusefulinidentifyingunknownsequencesaswellasgeneorproteinfunctionprediction.

  • SomeBLASTterminologyWord – substringofasequenceWordpair – pairsofwordsofthesamelengthQuery – sequencetobesearchedagainstthedatabaseHit – matchestoqueryinthedatabaseScoreofawordpair– scoreofthegaplessalignmentoftwowords

    VALMRVAKNSscore=4+2+-3+-2+0=1(PAM250)

    HSP – highscoringpair

  • HowdoesBLASTwork• Parameters: w=lengthofahit;T=minscoreofahit(forproteinsw=3,T=13(BLOSUM62)).

    • Step1:GivenaquerysequenceQ,sequenceissplitintowordsofdefinedlength.

    • Step2:Scanthedatabaseforexactmatchingwiththelistofwordscompiledinstep1.

    • Step3:Pairwisealignmentbyextendinghitsfromstep2inbothdirections.

    • Step4: Evaluationsignificanceofextendedhitsfromstep3

    • Step5:ReturnsequenceswithHSPwhichhavestatisticallysignificantscores.

  • Step1:FindhighscoringwordsNucleotidequeryissplitintowordsofdefinedlength

    GTAAAATCATCAT(w=11)GTAAAATCATCTAAAATCATCAAAAATCATCAT

    Theproteinqueryissplitintowords.Foreverwordx oflengthw inQmakealistofwordsthatwhenalignedtoxscoreatleastTthresholdforhits.

    PDERTYHI(w=3)PDE DER ERT RTY TYH YHIPDD DDR ERY RTF TYY YHIPDN EDR ERF RFY TTH THV

    ExampleLetx=PDE thenscoreforPDA=5+5+0 (dropped)andforPDD=5+5+4 (taken).NumberofwordsdependonwandT

  • Step1:FindhighscoringwordsMVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDISQueryword– GSK,W=3,cutoffwordsbyT=11WordScore(BLOSSUM-62)

    GSK 15GAK 12GNK 12GTK 12GSR 12GDK 11GQK 11GEK 11GGQ 11GKA 10 ThresholdscoreT=11

  • Step2:FindinghitsThedatabaseissearchedforexactmatchesofwords.

    GTAAAATCAAGTCCAGTATGACCTTCAAGTCCA

    MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDIS

    Queryword– GSK,W=3,T=11Query1 MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRD 40

    MVRERKCILCHI++GS+KEMDEHMRSMLHHRELENLKGR+Sbjct 1 MVRERKCILCHIIHGSEKEMDEHMRSMLHHRELENLKGRE 40

  • Step3:Extendinghits• Parameter:X (controlledbyuser)• Extendthehitsbypairwisealignmentinbothdirections.

    • Extensioncontinues,tillsomepointmismatchesandgapsdropthescorebelowthresholdX extensionisterminated.

    Query1 MVRERKCILCHIVYGSKKEMDEHMRSMLHHRELENLKGRDMVRERKCILCHI++GS+KEMDEHMRSMLHHRELENLKGR+

    Sbjct 1 MVRERKCILCHIIHGSEKEMDEHMRSMLHHRELENLKGRE

  • Step4&5:ReturningHSPs

    • BLASTreturnsthescorehighestscoringsegmentpair(HSP)

    • Isthescorehighenoughtoprovideevidenceofhomology?

    • Arethescoresofalignmentsofrandomsequenceshigherthanthisscore?

  • ScoringsystemNucleotides

  • ScoringsystemsPROTEINS1. BLOSUM (BLOcks ofaminoacidSubstitutionMAtrix)

    üBLOSUMmatrixisderivedfromconservedun-gappedblocksofproteinsequencealignments.

    üBLOSUMmatricesaredirectlycalculatedacrossvaryingevolutionarydistance:

    üBLOSUM-45representssequenceswith45%identityüBLOSUM-80representssequenceswith80%identityüThehighertheBLOSUMmatrixthemorecloselyrelatedthesequencesshouldbe

  • ScoringsystemsPROTEINS2. PAM (PointAcceptedMutations)• Pointacceptedmutationbasedonthefactthataminoacidsofthesamesize,chargeorhydrophobicityarelikelytobesubstitutedforeachother

    • PAM-1matrixrepresentsanaveragechangeof1%(1substitutionin100residues).

    • PAM-1wouldbesuitableforverycloselyrelatedsequences

    • PAM-250=250mutationsin100residues.• HencehigherPAMmatricesareusedforsequencesofgreaterevolutionarydistance

  • BlastFlavorsSome Flavors of BLAST

    ucleotide rotein N

    NN

    N

    N

    N

    P

    P

    blastx

    tblastn

    tblastx

    P P P P P P

    P P P P P P P P P P P P

    P P P P P P

    Query Database Program

    blastp

    blastn

    P P N N

    PP

  • GraphicalBLASTresults

    ü ThisisagraphicalviewofthedistributionofBLASThitsonthequerysequence

    ü Thelengthofthehitsshowsthequerycoverageandregionsofsimilarity

    QuerySequence

    BlastHits,Amouseovergivesyouthedetails.Clicktoviewalignment

  • HitListBLASTresultHitlistgivestheidentifyofsequencessimilartoyourquerysequencerankedbysimilarity

    Bitscorevalues<50unreliable

    %Querycoverage

    E-value

    Sequencedefinition,clicktoviewthepairwisealignment

    Accessionnumber,LinktotherecordinGenBank

  • Pairwisealignment(Protein)

    ”AA":identical.”+":conservedsubstitutions”blank":semi-conservedsubstitution(similarshapes).

  • Pairwisealignment(nucleotide)

  • BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:

  • BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:• E-value =Expectedvalue.(thisindicatestheprobabilitythattheblasthitmayhaveoccurredbyrandomchance).

    • ThelowertheE-value(orthecloseritisto0)themoresignificantthehit.TobecertainofhomologyyourE-valuemustbebelow10-4 or0.001.

  • BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:• E-value =Expectedvalue.(thisindicatestheprobabilitythattheblasthitmayhaveoccurredbyrandomchance).

    • ThelowertheE-value(orthecloseritisto0)themoresignificantthehit.TobecertainofhomologyyourE-valuemustbebelow10-4 or0.0001.

    • %identitythehighertheidentitytheincreasinglikelihoodofhomology.(nucl>=70%,prot>=25%)

  • BLASTresultinterpretation• Howdoyoumakeyourconclusiononhomology:• E-value =Expectedvalue.(thisindicatestheprobabilitythattheblasthitmayhaveoccurredbyrandomchance).

    • ThelowertheE-value(orthecloseritisto0)themoresignificantthehit.TobecertainofhomologyyourE-valuemustbebelow10-4 or0.0001.

    • %identitythehighertheidentitytheincreasinglikelihoodofhomology.

    • Querycoverage – ifahithashighquerycoverageandsimilarityinincreasesthechancesofhomology.

  • Summaryfornucleotidesequence

    Length Database Purpose BLASTProgram

    20bp orlonger Nucleotide Identifythequerysequence

    blastnmegablast

    Find similarnucleotidesequence.

    blastn

    Findsimilarproteinstotranslated queryinatranslatednucleotidedatabase

    tblastx

    Protein Find proteinscodedinmyqueryDNAsequence

    blastx

  • SummaryforproteinsequencesLength Database Purpose BLASTProgram15residuesorLonger

    Protein Identifyyourquerysequence orfindproteinsequencessimilartoit

    blastp

    Findmembers ofaproteinfamilyorbuildacustompositionspecificscoringmatrix(PSSMs)

    PSI-blast

    Findproteins similartothequeryaroundagivenpattern

    PHI-blast

    Conserveddomains

    Findconserveddomainsinyourqueryandidentifyother proteinswithsimilardomains

    CD-search

    Nucleic Findsimilarsequencesinatranslatednucleotide sequencedatabase.

    tblastn

  • BLASTPracticalTheblastserverisavailableathttps://blast.ncbi.nlm.nih.gov/Blast.cgiBlastthebelowunknownsequence

    >unknownsequencesGGCATGAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCCACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACCCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCAGTCATAGGAAGGGGAGAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATT

    WhatistheidentityofthesequenceWhatisthe%similarity,E-valueandQuerycoverageArethestatisticsindicativeofhomologyTryblastx onthesequence

  • BLASTPracticalBlastthebelowunknownproteinsequence(useBLASTP)

    >unknownproteinMASSVISSAAVATRTNVTQAGSMIAPFTGLKSAATFPVSRKQNLDITSIASNGGRVRCMQVWPPINTKKYETLSYLPDLTDEQLLKEVEYLLKNGWVPCLEFETEHGIVYREKHKSPGYYDGRYWNMWKLPMFGCTDATQVLAEVQECKKSYPQAWIRIIGFDNVRQVQCISFIAYKPEGY

    • Canyoutellmewhatitis?• Letsdoaproteinlevelsearchagainstgreenplants(Viridiplantae)• Andatranslationalsearchagainstgreenplantsusingtblastn• Lookattheconserveddomains

  • FurtherpracticeonBLAST

    • https://digitalworldbiology.com/tutorial/blast-for-beginners

  • Thankyou

    [email protected]

    BasicBioinformaticsTraining2017ILRI,AddisAbaba- Ethiopia

    Dec11– 15,2017