53
Assembling diverse & rich metagenomes: the secrets of the ancients. C. Titus Brown [email protected]

2014 marine-microbes-grc

Embed Size (px)

Citation preview

  • 1.Assembling diverse & rich metagenomes: the secrets of the ancients. C. Titus Brown [email protected]

2. Introducing myself -- ged.msu.edu/ Data-intensive biology tools, etc. Not a marine microbiologist at all! Note: these slides are all on slideshare. (Google titus brown slide share) 3. My goals Enable hypothesis-driven biology through better hypothesis generation & refinement. Devalue interest level of sequence analysis and put myself out of a job. Be a good mutualist! 4. Part I: Soil Assembly & the Great Prairie Grand Challenge 2008 5. Soil microbial ecology - questions What ecosystem level functions are present, and how do microbes do them? How does agricultural soil differ from native soil? How does soil respond to climate perturbation? Questions that are not easy to answer without shotgun sequencing: What kind of strain-level heterogeneity is present in the population? What does the phage and viral population look like? What species are where? 6. A Grand Challenge dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp Adina Howe 7. Approach assemble into contigs. We found that short reads from phylogenetically distant and microbially diverse environments could not be reliably annotated. => Build into longer contigs first. 5 year odyssey 8. (Friends dont let friends BLAST short reads.**) ** Applicable to most environmental samples.Howe et al., 2014 9. Developed two new methods -- I. Computational cell sorting II. Computational library normalization. See: Pell et al., Tiedje, Brown (2012); Howe et al., Tiedje, Brown (2014); Goffredi et al. (2014) 10. Digital normalization 11. Digital normalization 12. Digital normalization 13. Digital normalization 14. Digital normalization 15. Digital normalization 16. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Result: we (easily, casually) assembled two of the biggest metagenomes ever. Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729 (Ill come back to this) 17. So We can now achieve an assembly of pretty much anything (soil was really hard, virtually everything else is easier!) Lots of people are interested in collaborating with us on this! but we regard it as a largely solved problem. 18. I: assembly protocols khmer-protocols: open, versioned, citable, forkable set of instructions to assemble euk mRNAseq and metagenomes on widely accessible compute resources. Explicit command-line instructions to go from raw reads to annotated final product. For mRNAseq: ~$150/compute for $2000 of data. (Still in beta, note.) 19. khmer-protocols Read cleaning Preprocessing Assembly Annotation 20. Example - Deep Carbon data set Masimong Gold Mine; microbial cells filtered from fracture water from within a 1.9km borehole. (32,000 year old water) 5.6m reads, 601.3 Mbp; computational protocol took 4 hours; Assembled to 56 Mbp > 300 bp longest contig is 73kb 70% of paired-end reads mapped. 20 w/M.C.Y. Lau, Tullis Onstott 21. Our (open) approach: If the protocols work for you, great! Cite us. If the protocols dont work for you, please let us know so we can fix them. If its a challenging problem, wed love to collaborate. We are also happy to help train people. 22. Things we no longer worry about (much) lets chat: Inter-species assembly chimerae apart from w/in strain variants, chimerae are hard to form with contig assembly. Finding homology matches in metagenomes contigs give as good a match as possible. Assembling contigs when we have sufficient coverage not enough coverage is usually the problem. 23. II: Shotgun sequencing and coverage Coverage is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 just draw a line straight down from the top through all of the reads. 23 24. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human) 24 25. Assembly depends on high coverage 25 HMP mock community 26. Downstream goals of assembly: (Even assuming ribotyping works perfectly) Annotate genes with higher confidence. Reconstruct operons & ultimately even full genomes. Analyze strain variation. Study organisms that ribotyping cant (phage & virus) 27. Main questions -- I. How do we know if weve sequenced enough? II. Can we predict how much more we need to sequence to see ? Note: necessary sequencing depth cannot accurately be predicted from SSU/amplicon data 28. Method 1: looking for WGS saturation We can track how many sequences we keep of the sequences weve seen, to detect saturation. 29. Data from Shakya et al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing 30. Data from Shakya et al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing C=10, for assembly 31. Estimating metagenome nt richness: # bp at saturation / coverage MM5 deep carbon: 60 Mbp Iowa prairie soil: 12 Gbp Amazon Rain Forest Microbial Observatory soil: 26 Gbp Assumes: few entirely erroneous reads (upper bound); at saturation (lower bound). 31 32. WGS saturation approach: Tells us when we have enough sequence. Cant be predictive if you havent sampled something, you cant say anything about it. Can we correlate deep amplicon sequencing with shallower WGS? 33. Correlating 16s and shotgun seq Errors do not strongly affect saturatio How much of 16s do you see with how much shotgun sequencing 34. Data from Shakya et al., 2013 (pmid: 23387867 WGS saturation ~matches 16s saturation < rRNA copy number > 35. 16s region choice is not significant (?!) Data from Shakya et al., 2013 (pmid: 23387867 36. Method is robust to organisms unsampled by amplicons. Insensitive to amplicon primer bias. Robust to genome size differences, eukaryotes, phage. Data from Shakya et al., 2013 (pmid: 23387867 37. Can examine specific OTUs Data from Shakya et al., 2013 (pmid: 23387867 38. OTU abundance is ~correct. Data from Shakya et al., 2013 (pmid: 23387867 39. Running on real communities -- 40. Running on real communities -- 41. Thoughts on 16s/WGS comparison: Robust to some real problems (primer bias; organisms unsampled by amplicon seq) & insensitive to 16s seq error. Hopefully can be used to build a predictive framework to answer how much more sequencing should I do? Sensitivity: What have I missed? Planning: How much $$ should I ask 42. Other things that yall might be interested in: Comparing 16s from amplicon and shotgun sequencing. Metatranscriptome assembly protocol Biogeography of genomic sequence 43. Metatranscriptome assembly (soil) Total Length (bp) Total rRNA (bp) Total annotated by MG-RAST m5nr SEED Unassembled MetaT 20,525,296,600 16,987,863,800 (82.8%) 48,080,200 (0.23%) Assembled MetaT 32,471,548 7,061,913 (21.8%) 2,075,701 (6.4%) Aaron Garoutte (w/Tiedje & Howe) 44. Using shotgun sequence to cross-validate amplicon predictions 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% AMP/RDP AMP/SILVA WGS/RDP WGS/SILVA WGS/SILVA(LSU) Amplicon seq missing Verrucomicrob Jaron Guo 45. Primer bias against Verrucomicrobia Check taxonomy of reads causing mismatch (A) Verrucomicrobia cause 70% (117/168) of mismatch Current primer is not effective at amplifying Verrucomicrobia Jaron Guo 46. Biogeography of genomic DNA How much genomic DNA is shared between different sites? Qingpeng Zhang 47. Biogeography of genomic DNA (2) How much genomic richness is shared between different sites? Qingpeng Zhang 48. Concluding thoughts Tools and protocols for data analysis are fast becoming intrinsic to practice of biology. Most tools are wrong, but some are useful. All of our tools are openly, freely available in every way possible. We are trying to make assembly fast, cheap, easy, and good. We are building on our assembly-based approaches & intuition to tackle other questions. 49. Big Data is neither the real problem nor the solution. Dealing with Big Data requires a new mentality, so training/experience is probably most effective way forward. With sequencing, few if any of your biology problems go away, although some aspects may become more tractable. Think future: any -ome you want from any sample you can get. So now 50. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp We dont know what most genes do. Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729 51. Potential discussion topics A. Funding and collaboration models. B. Leveraging data & computation to help understand gene function. C. Computational/data infrastructure but planning for poverty, not wealth: sustainability and bus factor. D. Capacity building Standardized data sets; data availability. Workshops and training. 52. Training in data analysis et al. Software Carpentry. Data Carpentry. STAMPS, EDAMAME, MSU NGS course. 53. Potential discussion topics A. Funding and collaboration models. B. Leveraging data & computation to help understand gene function. C. Computational/data infrastructure but planning for poverty, not wealth: sustainability and bus factor. D. Capacity building Standardized data sets; data availability. Workshops and training.