Sequencing run grief counseling: counting kmers at MG-RAST

  • View
    118

  • Download
    0

Embed Size (px)

DESCRIPTION

Talk by Will Trimble of Argonne National Laboratory on April 29, 2014, at UIC's department of Ecology & Evolution on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.

Text of Sequencing run grief counseling: counting kmers at MG-RAST

  • 1. Sequencingrungriefcounseling: coun0ngkmersatMG-RAST WillTrimble metagenomicannota0ongroup ArgonneNa0onalLaboratory April29,2014UIC

2. Apology:Ispeakbiology withanaccent Ispentsixyearsindarkroomswithlasers NowIusecomputerstoanalyzehigh-throughput sequencedata. Iintroducemyselfasanappliedmathema0cian. Findingscoringfunc0onstouseambiguousdatato answerlifespersistentques0ons. 3. Apology:Ispeakbiology withanaccent Ispentsixyearsindarkroomswithlasers NowIusecomputerstoanalyzehigh-throughput sequencedata. Iintroducemyselfasanappliedmathema0cian. Findingscoringfunc0onstouseambiguousdatato answerlifespersistentques0ons. Shovelingdatafromthedataproducingmachineinto thedata-consumingfurnace. 4. Sequencesaredierent Sequencingislikephotography Sequencingisbeau0ful thumbnailpolish Howdiversearemyshotgunsequences? nonpareil-k! kmerspectrumanalyzer! ! ! Outline 5. Sequencesaredierent(math) Sequencingislikephotography(pictures) Sequencingisbeau0ful thumbnailpolish (micrographs) Howdiversearemyshotgunsequences? nonpareil-k (graphs) kmerspectrumanalyzer! (graphs) Outline 6. Sequencesaredierent Sequencingproducessequences.Sequences arequalita0velydierentfromallotherdata types. Low-throughput categoricaldata Categoriesaresound 7. Sequencesaredierent Sequencingproducessequences.Sequences arequalita0velydierentfromallotherdata types. Instrumentreadings, spectra,micrographs Notcategorical. Low-throughput categoricaldata Categoriesaresound 8. Sequencesaredierent Sequencingproducessequences.Sequences arequalita0velydierentfromallotherdata types. @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrumentreadings, spectra,micrographs Notcategorical. Low-throughput categoricaldata Categoriesaresound Highthroughput sequencedata Categoriesuncertain 9. Sequencesaredierent Sequencingproducessequences.Sequences arequalita0velydierentfromallotherdata types. @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrumentreadings, spectra,micrographs Notcategorical. Low-throughput categoricaldata Categoriesaresound Highthroughput sequencedata Categoriesuncertain 100-102 102-107 1012-1080 10. Experiment design Sequencingrun Sequencedata Assembly, Annota0on SEEDM5NR 489 !Sensory box/GGDEF family! 470 !hyphothetical protein! 241 !Co-Zn-Cd resistance CzcA! 202 !Transposase! 200 !homocysteine methyltransferase (EC 2.1.1.13)! 175 !cyclase/phosphodiesterase ! 164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)! 156 !Methyl-accepting chemotaxis protein! 149 !ABC transporter, ATP-binding protein! 147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)! 133 !Ferrous iron transport protein B! Sowereducesequencedatato categoricaldata. 11. Forward-backwardproblem Experiment design Sequencingrun Sequencedata Assembly, Annota0on SEEDM5NR 489 !Sensory box/GGDEF family! 470 !hyphothetical protein! 241 !Co-Zn-Cd resistance CzcA! 202 !Transposase! 200 !homocysteine methyltransferase (EC 2.1.1.13)! 175 !cyclase/phosphodiesterase ! 164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)! 156 !Methyl-accepting chemotaxis protein! 149 !ABC transporter, ATP-binding protein! 147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)! 133 !Ferrous iron transport protein B! 1012 103-105100-101 Sowereducesequencedatato categoricaldata. 12. Sequencesaredierent Sequencingproducessequences.Sequences arequalita0velydierentfromallotherdata types. Eachsequenceisaninforma0on-rich(possibly corrupted)quota9onfromthecatalogof gene0cpolymers. 13. Whatisthissequence? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Whowrotethisline? be regarded as unproved until it has been checked against more exact results Searching 14. Whatisthissequence? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Whowrotethisline? be regarded as unproved until it has been checked against more exact results Searching Sameanswerforbothpuzzles: yougotothiswebsite 15. Whatisthissequence? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Whowrotethisline? be regarded as unproved until it has been checked against more exact results Searching Howlongdoreadsneedtobe torecognizethem? Howlongdophrasesneedtobeto recognizethem? 16. Howlongdoreadsneedtobe? Informa9on(Shannon,1949,BSTJ): isaquan0ta0vesummaryoftheuncertaintyofa probabilitydistribu9onamodelofthedata Profoundapplicabilityinmachinelearningand probabilis0cmodeling H = X i pi log2 1 pi 17. Howlongdophrasesneedtobe? Exercise:Pickabookfromyourbookshelf. Pickanarbitrarypageandarbitraryline. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.! 18. Informa0oncontentofEnglishwords: Hwordca.12bitsperword. Sizeofgooglebooks? Biglibrarieshavefew107books, eachonehas105indexedwords .soadatabasesizeof1012words. log(databasesize)= 1012=239.9=40bits Soweexpectonaverage40/12=3.3=4words tobeenoughtondaphraseingooglesindex. Tryit. Howlongdophrasesneedtobe? 19. Howlongdophrasesneedtobe? Exercise:Pickabookfromyourbookshelf. Pickanarbitrarypageandarbitraryline. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.! 20. Howlongdophrasesneedtobe? Exercise:Pickabookfromyourbookshelf. Pickanarbitrarypageandarbitraryline. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.! Usuallynailsyoursourcein fourwords. 21. Maximuminforma0oncontentofbasepairs Hread2bitsperlength-sequence Mostlongkmersaredis0nct: genomeofsizeG(ca1010bp) log(G)= 1010=233.2=34bits Soweexpectthatwhen2>34bits,weshouldbe abletoplaceanysequence. Thatmeansweneedatleast17basepairs (seemssmall)todelivermailanywhereinthe genome. Howlongdoreadsneedtobe? ` ` ` ` 22. Maximuminforma0oncontentofbasepairs Hread2bitsperlength-sequence Mostlongkmersaredis0nct: genomeofsizeG(ca1010bp) log(G)= 1010=233.2=34bits Soweexpectthatwhen2>34bits,weshouldbe abletoplaceanysequence. Thatmeansweneedatleast17basepairs (seemssmall)todelivermailanywhereinthe genome. Howlongdoreadsneedtobe? ` ` ` ` Shortsequencesendupbeingvery dis0nc0ve,evenngerprint-like. 23. ` Check:Humanreferencegenome 24. Thedatadeluge Thereweresometechnological breakthroughsinthemid-2000sthat ledtoinexpensivecollec0onof10s ofGbytesofsequencedataatonce. Thedatahasoutgrownsome favoritealgorithmsfromthe1990s (BLAST) 25. http://www.mcs.anl.gov/~trimble/flowcell/! thumbnailpolish! 26. Rarefac0onofaphotograph Acamerarecordsthe numberofphotonsthat landoneachofmillions ofpixels. Asequencerrecordsthe numberofsequences thatlandineach possiblesequence. Iactuallythinkofasequencerlikea mul0channelgene0cspectrometer. 27. Rarefac0onofaphotograph Acamerarecordsthe numberofphotonsthat landoneachofmillions ofpixels. Asequencerrecordsthe numberofsequences thatlandineach possiblesequence. Iactuallythinkofasequencerlikea mul0channelgene0cspectrometer. 28. Thegene0cspectrometer Withmy1012-channel gene0cspectrometer,I amtryingtoar0culate thediversityofwhatthe sequencersees. Speciesdiversity ATCGCGAAAAGTCCC 2! AAAAAAAAAAAAAAA 459! AAAAAAAAAAAAAAC 71! AAAATAAAAAAAATA 1! AAAAAAAAAAAAAAG 36! ACATGAAAAACAACT 1! AAAAAAAAAAAAAAT 23! AAAAAAAAAAAAACA 95! GTAGGAAAAGCCCAC 1! AAAAAAAAAAAAACC 7! AAAAAAAAAAAAACG 8! AAAAAAAAAAAAACT 9! AAAAAAAAAAAAAGA 36! AACAAGAAAAACAAA 1! AAAAAAAAAAAAAGC 10! AAATAAAAAAAATAG 1! AACAGAAAAAACACG 1! AAAAAAAAAAAAAGG 2! AAAAAAAAAAAAAGT 6! 29. Thegene0cspectrometer Withmy1012-channel gene0cspectrometer,I amtryingtoar0culate thediversityofwhatthe sequencersees. Speciesdiversity Genediversity ATCGCGAAAAGTCCC 2! AAAAAAAAAAAAAAA 459! AAAAAAAAAAAAAAC 71! AAAATAAAAAAAATA 1! AAAAAAAAAAAAAAG 36! ACATGAAAAACAACT 1! AAAAAAAAAAAAAAT 23! AAAAAAAAAAAAACA 95! GTAGGAAAAGCCCAC 1! AAAAAAAAAAAAACC 7! AAAAAAAAAAAAACG 8! AAAAAAAAAAAAACT 9! AAAAAAAAAAAAAGA 36! AACAAGAAAAACAAA 1! AAAAAAAAAAAAAGC 10! AAATAAAAAAAATAG 1! AACAGAAAAAACACG 1! AAAAAAAAAAAAAGG 2! AAAAAAAAAAAAAGT 6! 30. Thegene0cspectrometer Withmy1012-channel gene0cspectrometer,I amtryingtoar0culate thediversityofwhatthe sequencersees. Speciesdiversity Genediversity Sequencediversity ATCGCGAAAAGTCCC 2! AAAAAAAAAAAAAAA 459! AAAAAAAAAAAAAAC 71! AAAATAAAAAAAATA 1! AAAAAAAAAAAAAAG 36! ACATGAAAAACAACT 1! AAAAAAAAAAAAAAT 23! AAAAAAAAAAAAACA 95! GTAGGAAAAGCCCAC 1! AAAAAAAAAAAAACC 7! AAAAAAAAAAAAACG 8! AAAAAAAAAAAAACT 9! AAAAAAAAAAAAAGA 36! AACAAGAAAAACAAA 1! AAAAAAAAAAAAAGC 10! AAATAAAAAAAATAG 1! AACAGAAAAAACACG 1! AAAAAAAAAAAAAGG 2! AAAAAAAAAAAAAGT 6! 31. Rarefac0onofaphotograph Samplingonlyafew sequencesislike exposingthecamera fortooshorta0me. Notenoughphotons tomakeoutthe picture. 32. Rarefac0onofaphotograph somepartsseemtobedark. 33. Rarefac0onofaphotograph 34. Rarefac0onofaphotograph Thislookslikeaportrait 35. Rarefac0onofaphotograph 36. Rarefac0onofaphotograph Starttoseethemood 37. Rarefac0onofaphotograph 38. Rarefac0onofaphotograph A0nybitofgraininessleg 39. Rarefac0onofaphotograph shotnoiseinelectrical engineering 40. Rarefac0onofaphotograph AstudioportraitofJaneGoodall 41. Ascien0cimage Thisisafamous scien0cimage. Anybodyrecognizeit? 42. Ascien0cimage Doesthishelp? 43. Ascien0cimage Therearesmallpatcheso