Basic Sequence Analysis - Technical University of...

Preview:

Citation preview

BasicSequenceAnalysis

LarsRønn OlsenAssistantProfessor, TechnicalUniversityofDenmark

Learningobjectives

Aftertoday,youwillbeableto:

• UnderstandhowBLASTworks

• UseBLASTforsequencesimilaritysearch

• Understand thetheorybehinddenovoassemblyofgenomesfromsequencereads

• Understand thetheorybehindshortreadalignment

• Understandhowmultiplesequencealignmentworks

• Usemultiplesequencealignment toexaminesequencevariability

• Usepublicwebservicestopredictvaccinetargetsinpathogens

Example:vaccinedesignworkflow

Pathogenofinterest:Denguevirus

Generalpathogeninformation:WikiPediaPubMed

Speciesinformation:NCBITaxonomy

Genomicsequencedata:NCBIGenBank

Geneinformation:GeneCardsAmiGO

Genomicsequencedata:Wholegenomesequencing

Proteinsequencedata:NCBIprotein

SwissProt/UniProt

Geneexpressionprofiles:NCBIGEO

Selectionofvaccinetargets

Database

Tool

BLAST- Searchingdatabasesforsequences

BLAST(BasicLocalAlignmentSearchTool) isatooltoqueryadatabase forsequencessimilartoaninputsequence.

Imagineyouhavesequencedagenefromanunknownsample,andyouwouldliketoknowwhatitis.YoucanuseBLASTinNCBItocompareyoursequencetoALLthe197,390,691sequencesinGenBank!

https://blast.ncbi.nlm.nih.gov/Blast.cgi

BLAST- Searchingdatabasesforsequences

Example:youwouldliketoknowwhatthefollowingsequenceis:

X)AATGCCG

Youhavethefollowingthreesequencesinyourdatabase:

A)CGTGTGATCB)AATGCCGC)GCTGTGAC

https://blast.ncbi.nlm.nih.gov/Blast.cgi

BLAST- Searchingdatabasesforsequences

Example:youwouldliketoknowwhatthefollowingsequenceis:

AATGCCG

Youhavethefollowingthreesequencesinyourdatabase:

A)CGTGTGATCB)AATCCCGC)GCTGTGAC

https://blast.ncbi.nlm.nih.gov/Blast.cgi

(Tip: use the font “courier new”)

X) AATGCCGB) AATCCCG

BLAST- Searchingdatabasesforsequences

Example:youwouldliketoknowwhatthefollowingsequenceis:

X)AATGCCG

Youhavethefollowingthreesequencesinyourdatabase:

A)CGTGTGATCB)AATCCCCGC)GCTGTGAC

https://blast.ncbi.nlm.nih.gov/Blast.cgi

X) AATG-CCGB) AATCCCCG

BLAST- Searchingdatabasesforsequences

Example:youwouldliketoknowwhatthefollowingsequenceis:

X)AATGCCG

Youhavethefollowingthreesequencesinyourdatabase:

A)AATCCCGB)AATCCCCGC)AATCC

https://blast.ncbi.nlm.nih.gov/Blast.cgi

X) AATGCCGA) AATCCCG

X) AATG-CCGB) AATCCCCG

X) AATGCCGC) AAT-CC-

Whichmatchisbest?

BLAST- Searchingdatabasesforsequences

https://blast.ncbi.nlm.nih.gov/Blast.cgi

Example:vaccinedesignworkflow

Pathogenofinterest:Denguevirus

Generalpathogeninformation:WikiPediaPubMed

Speciesinformation:NCBITaxonomy

Genomicsequencedata:NCBIGenBank

Geneinformation:GeneCardsAmiGO

Genomicsequencedata:Wholegenomesequencing

Proteinsequencedata:NCBIprotein

SwissProt/UniProt

Geneexpressionprofiles:NCBIGEO

Selectionofvaccinetargets

Database

Tool

Sequencingreadalignment- Toolsfornextgenerationsequencing

Basecalls

Qualitycontrol/trimming

Sequencingwithout reference:Denovoassembly

Rawsequencingoutput Database

Tool

Sequencingwithreference:Short readalignment

GenBank/RefSeq

Geneorgenomesequence

Sequence readarchive

Denovoassembly- Makingsenseofsequencingreadswithoutareferencegeneorgenome

ATGACGTTT

TTTCTGAAA

AAATTCCCC

CCCCTGGCCA)

B)

C)

D)

ATGACGTTT AAATTCCCC

TTTCTGAAA CCCCTGGCC

A) B)C) D)

ATGACGTTTCTGAAATTCCCCCTGGCC

Denovoassembly- Makingsenseofsequencingreadswithoutareferencegeneorgenome

ATGACGCCC

CCCCTGCCC

CCCTTCCCC

CCCCTGGCCA)

B)

C)

D)

ATGACGCCC CCCTTCCCC

CCCCTGCCC CCCCTGGCC

A) B)C) D)

ATGACGCCCCTGCCCTTCCCCCTGGCC

ATGACGCCC

CCCTTCCCC

CCCCTGCCC

CCCCTGGCC

A) C)B) D)

ATGACGTTTTTCCCCCTGCCCCTGGCC

???

Denovoassembly- Makingsenseofsequencingreadswithoutareferencegeneorgenome

ATGACGCCC

CCCCTGCCC

CCCTTCCCC

CCCCTGGCCA)

B)

C)

D)

ATGACGCCC CCCTTCCCC

CCCCTGCCC CCCCTGGCC

A) B)C) D)

ATGACGTTTCTGCCCTTCCCCCTGGCC

ATGACGCCC

CCCTTCCCC

CCCCTGCCC

CCCCTGGCC

A) C)B) D)

ATGACGTTTTTCCCCCTGCCCCTGGCC

???

TTCCCCCTGE)

TTCCCCCTGTTCCCCCTG

Sequencingreadalignment- Toolsfornextgenerationsequencing

Basecalls

Qualitycontrol/trimming

Sequencingwithout reference:Denovoassembly

Rawsequencingoutput Database

Tool

Sequencingwithreference:Short readalignment

GenBank/RefSeq

Geneorgenomesequence

Sequence readarchive

Shortreadalignment- Aligningreadstoareferencegeneorgenome

ATGACGTCAGCTGTTGGCGACATCGTTCGATCAGTCGATTATTCGATAATCGCTCTCTTAGReferenceATGACG TTGGCG CGTTCG AGTCGA TTCGAT TCTCTTReads

ReadDepth“x”

Makeyoursequencesavailable- Otherresearcherscanbenefitimmenselyfromyourwork!

YoursequencereadscanbedepositedintheSequenceReadArchive.

https://www.ncbi.nlm.nih.gov/sra

Makeyoursequencesavailable- Otherresearcherscanbenefitimmenselyfromyourwork!

Yourassembled/mapped sequencescanbedepositedinGenBank.

Example:vaccinedesignworkflow

Pathogenofinterest:Denguevirus

Generalpathogeninformation:WikiPediaPubMed

Speciesinformation:NCBITaxonomy

Genomicsequencedata:NCBIGenBank

Geneinformation:GeneCardsAmiGO

Genomicsequencedata:Wholegenomesequencing

Proteinsequencedata:NCBIprotein

SwissProt/UniProt

Geneexpressionprofiles:NCBIGEO

Selectionofvaccinetargets

Database

Tool

PredictingTcellepitopes- Toolsforvaccinetargetdiscovery

SequencevariabilityanalysisMultiplesequencealignment

PredictionofepitopesHLAalleleHLAallelefrequency

Selectionofepitopes

Proteinsequences

ImmuneEpitopeDatabase

Epitopesforvaccine

Database

Tool

Multiplesequencealignment- Aligningsequencestodeterminevariability

Multiplesequencealignment isatypeofalgorithmthatallowsyoutocompare2ormoresequencessimultaneously.

Thisishighlyuseful,forexamplewhenanalyzing thevariabilityofacertainproteininapathogen.

Therearemanydifferentalgorithms forthispurpose– oneofthemostfamousbeingClustalW from1997.

Infact,thistoolhasbeencited53,288times(number10inthepapermountain)andisstillcitedheavily.

HOWEVER!ThemainauthorofClustalW,DesHiggins,hasaskedpeopletostopusingandcitingitasthereareupgradesandotherbettertoolsavailabletoday!

Multiplesequencealignment- Aligningsequencestodeterminevariability

Howdoesitwork?VerysimilartoBLAST,exceptallsequencesareconsidered:

AATCCC-GAAATCCCCGTAATCC---T

AATCCCGAAATCCCCGTAATCCT

Multiplesequencealignment- Aligningsequencestodeterminevariability

Inpractice,youcopyordownloadallyoursequencesofinterestinfasta format.

Thefasta formatlookslikethis:

>Thislineistheheader.YoucanwritewhateveryouwanthereATCAGACTGTGCTGATCG…

Youthenpasteorupload thesequencestoamultiplesequencealignmentwebserver(orinstallitlocallyifyouhaveverylargedatasets)andrun thealignment.OneiswebPRANKwhichisavailablethroughEBI(EuropeanBioinformaticsInstitute).

http://www.ebi.ac.uk/goldman-srv/webprank/

Multiplesequencealignment- Aligningsequencestodeterminevariability

http://www.ebi.ac.uk/goldman-srv/webprank/

Multiplesequencealignment- Aligningsequencestodeterminevariability

http://www.ebi.ac.uk/goldman-srv/webprank/

PredictingTcellepitopes- Toolsforvaccinetargetdiscovery

SequencevariabilityanalysisMultiplesequencealignment

PredictionofepitopesHLAalleleHLAallelefrequency

Selectionofepitopes

Proteinsequences

ImmuneEpitopeDatabase

Epitopesforvaccine

Database

Tool

PredictionofTcellepitopes- WhatareTcellepitopes?

Virusproteinsarecleavedtoshortpeptides

Someofthesepeptidesbindtothehuman leukocyteantigenprotein(HLA)

Whenviruspeptidesarepresentedonthesurfaceoftheinfectedcell,Tcellskilltheinfectedcells

TheHLAproteincomeindifferentflavors

FindingTcellepitopes- Traditionalapproachtofindingimmunogenicregionsinpathogens

PredictionofTcellepitopes- Computationalapproachtofindingimmunogenicregionsinpathogens

Rappuoli R(2000) Reverse Vaccinology, Curr OpinMicrobiol

PredictionofTcellepitopes- HowtopredictpeptidebindingtoHLA

Therearealotofalgorithms topredictepitopesinpathogensandcancercells.

Among thebestperforming isNetMHC.

PredictionofTcellepitopes- HowtopredictpeptidebindingtoHLA

http://www.cbs.dtu.dk/services/NetMHC/

PredictionofTcellepitopes- HowtopredictpeptidebindingtoHLA

http://www.cbs.dtu.dk/services/NetMHC/

MRCVGVGNRRCVGVGNRDCVGVGNRDFVGVGNRDFVGVGNRDFVEVGNRDFVEGGNRDFVEGLNRDFVEGLSDFVEGLSGAFVEGLSGATVEGLSGATWEGLSGATWVGLSGATWVD

MRCVGVGNRRCVGVGNRDCVGVGNRDFVGVGNRDFVGVGNRDFVEVGNRDFVEGGNRDFVEGLNRDFVEGLSDFVEGLSGAFVEGLSGATVEGLSGATWEGLSGATWVGLSGATWVD

Epitopes Notepitopes

PredictionofTcellepitopes- HowtopredictpeptidebindingtoHLA

http://www.cbs.dtu.dk/services/NetMHC/

MRCVGVGNRDFVEGLSGATWVDVVLFQCLESIEGKAVQHENLKYTVIITVHTGDQHQVG

MRCVGVGNRRCVGVGNRDCVGVGNRDFVGVGNRDFVGVGNRDFVEVGNRDFVEGGNRDFVEGLNRDFVEGLSDFVEGLSGAFVEGLSGATVEGLSGATWEGLSGATWVGLSGATWVD

GVGNRDFVEVGNRDFVEGFVEGLSGATVEGLSGATW

Potentialepitopes

PredictingTcellepitopes- Toolsforvaccinetargetdiscovery

SequencevariabilityanalysisMultiplesequencealignment

PredictionofepitopesHLAalleleHLAallelefrequency

Selectionofepitopes

Proteinsequences

ImmuneEpitopeDatabase

Epitopesforvaccine

Database

Tool

HLAdatabases- Databaseswithinformationaboutthehumanleukocyteantigen

W

K

I

D

HLAcomesindifferent flavors(alleles).

Differentallelesbinddifferentpeptides.

HumanshavesixdifferentHLAalleles(threefromeachparent).

DifferentHLAallelesareprevalentindifferentpopulations– seetheHLAallelefrequencydatabase

http://www.allelefrequencies.net/

YoucanalsoexplorethesequencesoftheHLAsintheHLAalleledatabase

https://www.ebi.ac.uk/ipd/imgt/hla/

HLAdatabases- Databaseswithinformationaboutthehumanleukocyteantigen

W

K

I

D

http://www.allelefrequencies.net/

HLAdatabases- Databaseswithinformationaboutthehumanleukocyteantigen

W

K

I

D

http://www.allelefrequencies.net/

PredictingTcellepitopes- Toolsforvaccinetargetdiscovery

SequencevariabilityanalysisMultiplesequencealignment

PredictionofepitopesHLAalleleHLAallelefrequency

Selectionofepitopes

Proteinsequences

ImmuneEpitopeDatabase

Epitopesforvaccine

Database

Tool

Selectionofepitopes- ChoosingHLAbinders

Combiningvariabilityanalysis(multiple sequencealignment)andHLAbindingpredictionsletsyoupickthebestepitopesforyouvaccine.

Thiscanbedonemanuallyorusing, forexample,theBlockCons tool

Selectionofepitopes- ChoosingHLAbinders

http://met-hilab.cbs.dtu.dk/blockcons/

Selectionofepitopes- ChoosingHLAbinders

http://met-hilab.cbs.dtu.dk/blockcons/

Selectionofepitopes- ChoosingHLAbinders

Afteryouhaveselectedyourpotentialepitopes,youshouldcheckwhethertheyarepresentinhumanproteins,asvaccinationwiththesemayleadtolackofefficacyorpotentiallyautoimmunity.

Howwouldyoudothis?

Answer:useBLASTinGenBank,andsearchforthepeptidesinhumanproteins.

PredictingTcellepitopes- Toolsforvaccinetargetdiscovery

SequencevariabilityanalysisMultiplesequencealignment

PredictionofepitopesHLAalleleHLAallelefrequency

Selectionofepitopes

Proteinsequences

ImmuneEpitopeDatabase

Epitopesforvaccine

Database

Tool

ImmuneEpitopeDatabase(IEDB)- Databaseswithepitopesinanarrayofpathogenspecies

W

K

I

D

TheImmuneEpitopeDatabasecontainsepitopesthatotherresearchershavereportedintheliterature.

Useittoseeifanyonehasworkedexperimentallywithyourpredictedpeptidesbefore.Ifsomeonehasalreadytestedwhetheritgivesrisetoanimmuneresponse,youcanusethistoinformyourdecisiontousethemornot.

ImmuneEpitopeDatabase(IEDB)- Databaseswithepitopesinanarrayofpathogenspecies

W

K

I

D

http://www.iedb.org/

PredictionofTcellepitopes- Computationalapproachtofindingimmunogenicregionsinpathogens

Rappuoli R(2000) Reverse Vaccinology, Curr OpinMicrobiol

1,779 reportsofnovelorimprovedprediction algorithms

4,622 reportsofnovelvaccinetargets

142activeclinical trialsofvarious formsofDNAvaccines

2approved DNAvaccines...protecting horsesagainstWestNilevirusanddogs againstmelanoma.

16yearsofreversevaccinology- Whathaveweachieved?

Therearealotoftoolsoutthere

UseGoogletosearchfortoolsforyourspecificquestions.

Therearealsoonlineforawhereyoucanaskquestionsifyoucannotfindtheanswer.ForexampleBioStars:

https://www.biostars.org/

Youcanalsoexploretoolsinthetoolsregistryat:

https://bio.tools/

Takehomemessages

• Exploringand re-analyzingpublisheddataisincrediblyuseful- whenyousequencesomething, contributetotheresearchcommunityanduploadyourawandprocessedsequencedata!

Recommended