Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
Ray and Ray Cloud Browser for Metagenomics
Sébastien Boisvert @sebhtmlUniversité Laval, Québec, Canada
Beatles and Bioinformatics! #BeatlesAndBioinformatics University of Liverpool
27th November 2013 13:00
Talk: 40 minutesQuestions: 5 min
2
Where is Laval University ?
In Québec City
3
Canada is in the Commonwealth of Nations too !
● Canadian money
Photo: http://www.bridgeandtunnelclub.com/bigmap/outoftown/canada/money/
4
Super computing at Laval University
colosse#314 top500 06/20127616 Intel Xeon X5560 coresMellanox Technologies MT26428332 kW
5
Plan
● Background● Parallelism● Ray & metagenomics● Compare samples with Surveyor● Interactive visualization● Futures
6
Background
7
We buy sequencers and computers but...● We have:
– DNA sequencers to read genetic code (parallel)
– Supercomputers to compute stuff in the general sense (parallel)
Mardis, E. R. (2011, February). A decade/'s perspective on DNA sequencing technology. Nature 470 (7333), 198-203.
Sanger, F., S. Nicklen, and A. R. Coulson (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74 (12), 5463-5467.
Shendure, J. and H. Ji (2008, October). Next-generation DNA sequencing. Nature Biotechnology 26 (10), 1135-1145.
Sanger, F. (2001, March). The early days of DNA sequences. Nat Med 7 (3), 267-268.
Afuah, A. N. and J. M. Utterback (1991, December). The emergence of a new supercomputer architecture. Technological Forecasting and Social Change 40 (4), 315-328.
8
Trend
● However:– Genomics need more parallel software that scale with
biology's huge problems
Pollack, A. (2011). DNA sequencing caught in deluge of data. New York Times 1.
Baker, M. (2010, July). Next-generation sequencing: adjusting to data overload. Nature Methods 7 (7), 495-499.
Trelles, O., P. Prins, M. Snir, and R. C. Jansen (2011, February). Big data, but are we ready? Nature Reviews Genetics 12 (3), 224.
(2013, October). In need of an upgrade. Nature Biotechnology 31 (10), 857.
McPherson, J. D. (2009, November). Next-generation gap. Nature Methods 6 (11 Suppl), S2-S5.
Mardis, E. (2010). The $1,000 genome, the $100,000 analysis? Genome Medicine 2 (11), 84+.
9
I created some useful software
● Ray genome assembly, metagenome assembly, taxonomic profiling, sample comparison
● RayPlatform platform on which Ray is built● Ray Cloud Browser visualization of large genome
graphs
10
In this talk
● Ray (C++, started with bacterial genome assembly)● Ray Meta (assembling metagenomes with Ray)● Ray Communities (profiling metagenomes with
Ray)● Ray Surveyor (comparing DNA sequencing samples
without reference; Ray -run-surveyor)● Ray Cloud Browser (separate project )
11
Our original idea in 2010
● Mixing reads from different technologies (454 + Illumina)
● 2010 paper about Ray heuristics:
Boisvert, S., F. Laviolette, and J. Corbeil (2010, November). Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of Computational Biology 17 (11), 1519-1533.
12
Mixing sequencing reads
Figure from: Journal of Computational Biology 17 (11), 1519-1533.
13
Platform
● Goal: build a platform for distributed genomic computing
● Thread-based programming is hard● Message passing is easy to understand, scales. but
harder to program● Solution: framework to abstract everything
14
Platform perks
● Plugin interface● Actor model interface
● Runtimes:– Actor playground
– Standard mode
– Mini-ranks
15
RayPlatform's scalability
● Ray is scalability is measurable
Sample SRS011098 from Human Microbiome Project (202 487 723 reads)
Figure from:
Godzaridis, Boisvert, et al. Big Data (accepted)
16
Parallelism
17
Software should be parallel too
● Highly parallel genomic assays
Nature Reviews Genetics 7, 632-644 (August 2006)
● Couple of reviews about need for speed
Flicek, P. (2009, March). The need for speed. Genome biology 10 (3), 1-4.
Bonetta, L. (2006, February). Genome sequencing in the fast lane. Nature Methods 3 (2), 141-147.
Schatz, M. C., B. Langmead, and S. L. Salzberg (2010, July). Cloud computing and the DNA data race. Nature Biotechnology 28 (7), 691-693.
18
What is concurrency
● Several actions performed simultaneously during a period of time
● Example: give 1000000 sequences to 10 computers: each processes 100000 seq. simultaneously
● Threads are local to 1 computer● Processes can be distributed
19
Actor model for programming genomic tools
● In a nutshell: actors send messages to each other and can spawn actors
● Video: http://channel9.msdn.com/Shows/Going+Deep/Hewitt-Meijer-and-Szyperski-The-Actor-Model-everything-you-wanted-to-know-but-were-afraid-to-ask
Hewitt, C., P. Bishop, and R. Steiger (1973). A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the 3rd international joint conference on Artificial intelligence, IJCAI'73, San Francisco, CA, USA, pp. 235-245. Morgan Kaufmann Publishers Inc.
Agha, G. (1986). Actors: a model of concurrent computation in distributed systems. Cambridge, MA, USA: MIT Press.
21
Ray & metagenomics
22
Metagenomics (started in 1998)
● DNA sequencing is cheap● Bacteria in complex communities can not be
cultured easily● Metagenomics: direct DNA sequencing from
uncultured microorganisms● Field started by Jo Handelsman in 1998
Handelsman, J. (2004, December). Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews 68 (4), 669-685.
The microbiome explored: recent insights and future challenges. Blaser, Bork, Fraser, Knight & Wang Nature Reviews Microbiology 11, 213-217 (March 2013)
Handelsman et al. (Oct 1998) Chemistry & biology 5 (10).
23
Existing metagenomic tools do ABC, we do XYZ
● Metagenomic sequencing data must be analyzed● Methods A, B, C (16S = metagenomics)● We propose X, Y and Z (whole genome shotgun + k-mers)
● Also, so many choices (tools, sequencers), most do ABC, we do XYZ
Loman, N. J., C. Constantinidou, J. Z. Chan, M. Halachev, M. Sergeant, C. W. Penn, E. R. Robinson, and M. J. Pallen (2012, September). High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nature Reviews Microbiology 10 (9), 599-606.
Kahvejian, A., J. Quackenbush, and J. F. Thompson (2008, October). What would you do if you could sequence everything? Nature Biotechnology 26 (10), 1125-1133.
Metagenomics: DNA sequencing of environmental samples Nature Reviews Genetics 6, 805-814 (November 2005)
24
Some concepts
● Taxonomy: the branch of science concerned with classification, especially of organisms; systematics.
● Taxon: taxonomic group● Taxonomic tree: a tree of taxon● Leaf: a tree node without children● OTU: operational taxonomic unit
25
Taxonomic profiling with kmers
● Kmers: DNA words of length k● Given (1) a taxonomic tree and (2) data (usually
reads or kmers) on the tree's leaves● LCA: Last Common Ancestor to classify each kmer
to a node (possibly not a leaf)● Colored = labeled with a taxon or genome identifier
26
Examples
● Annotation with k-mers: Edwards, R. A., R. Olson, T. Disz, G. D. Pusch, V. Vonstein, R. Stevens, and R. Overbeek (2012, December). Real time metagenomics: using k-mers to annotate metagenomes. Bioinformatics (Oxford, England) 28 (24), 3316-3317.
● “Ray Communities” => Boisvert et al. 2012 Genome Biology● Scalable taxonomic assignation: Ames, S. K., D. A. Hysom, S. N. Gardner, G. S.
Lloyd, M. B. Gokhale, and J. E. Allen (2013, September). Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29 (18), 2253-2260.
27
Profile with kmers using Ray Communities
● Genome abundance● Taxon abundance (good correlation with Metaphlan)● Gene Ontology
28
UniFrac is mathematically sound
● Use taxon profiles● UniFrac: distance between 2 community samples
Lozupone, C. and R. Knight (2005, December). UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71 (12), 8228-8235.
29
Ray Meta
● “Ray Meta” => metagenome assembly with Ray● Binning with coverage may not accurate because
coverage depth changes with GC content and other factors
● Ray trick: instead of binning with coverage, bin with graph seeds (locality)
Boisvert, S., F. Raymond, E. Godzaridis, F. Laviolette, and J. Corbeil (2012, December). Ray meta: scalable de novo metagenome assembly and profiling. Genome Biology 13 (12), R122+.
● http://genomebiology.com/2012/13/12/R122
30
Assembled proportions of bacterial genomes for a simulated metagenome
with sequencing errors
1000 bacterial genomes with power law distribution3*10^9 readsSimulated errorsFigure 1, Boisvert et al. 2012 Genome Biology
Good assembly proportion of contained genomes within metagenome
31
Estimated bacterial genome
proportions● With kmer● Uniquely-colored k-
mers
A: 100-genome metagenome
B: 1000-genome metagenome
Figure 2, Boisvert et al. 2012 Genome Biology
32
Enterotypes
● 3 enterotypes:Arumugam, M. (...) and P. Bork (2011, April). Enterotypes of the human gut microbiome. Nature 473 (7346), 174-180.
● 2 enterotypes:Wu, G. D. (...) and J. D. Lewis (2011, October). Linking long-term dietary patterns with gut microbial enterotypes. Science (New York, N.Y.) 334 (6052), 105-108.
● Can we reproduce that with k-mers-based classification ?
33
Reproduction of enterotypes with k-mer based profiling
● Data: Qin et al. 2010 Nature (MetaHIT)
Figure 4, Boisvert et al. 2012 Genome Biology
34
Some quotes
● Snake assembly in Assemblathon 2:
“The Ray assembly was ranked 1st overall, and also ranked 1st for all individual measures except multiplicity (where it still had a better than average performance). “ GigaScience 2013, 2:10
● E. coli sequencing on MiSeq:
“Ray stood apart as the most accurate of the three assemblers, based on the number of inversions, relocations, SNPs, and a visual inspection of the associated dot plots” BMC Genomics 2013, 14:675
● “Ray will be a good validation assembler” Bastien Chevreux (Mira assembler author) http://article.gmane.org/gmane.science.biology.ray-genome-assembler/696
35
Compare samples with Surveyor
36
Using a graph to mine variation
Bubble caused by variation or sequencing error
37
Comparing metagenome samples
● Idea: compare samples without a reference● Be it variants, or kmer content● For kmer presence/absence, don't use coverage● For RNA-Seq or taxon abundances, compare
normalized kmer counts
38
Compare genomic content without a ref. with Surveyor
● Set of biological samples● DNA sequencing for each● Use Actor Model to compare a lot of samples● Build a de Bruijn graph that contains all of them (à
la fermi or Cortex), but distributed● In development
Iqbal, Z., I. Turner, and G. McVean (2013, January). High-throughput microbial population genomics using the cortex variation assembler. Bioinformatics 29 (2), 275-276. Cortex for microbial populations
Iqbal, Z., M. Caccamo, I. Turner, P. Flicek, and G. McVean (2012, February). De novo assembly and genotyping of variants using colored de bruijn graphs. Nature Genetics 44 (2), 226-232. Cortex
Li, H. (2012, July). Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28 (14), 1838-1844. Fermi
39
Ray -run-surveyor
● Existing methods enumerate variation entries● Genomic word content may also be interesting● Compare many samples (their kmer content)
40
Legionella
● 2012 outbreak in Quebec City● What's the source of contamination ?● 3 suspect cooling towers● On the Illumina MiSeq
41
Samples
● 22 patient-samples● 3 source-tower-samples (metagenomic)● 2 epidemic-strain-environmental-samples● 7 environmental-samples● 4 contemporaneous-samples● 5 old-1996-samples
42
Questions
● Are the 2012 strains similar to the 1996 (also in Québec City) strains ?
● Which cooling tower is the most-likely source of contamination ?
43
Similarity matrix (k spectrum kernel)
Ref. For spectrum kernel: Leslie, C., E. Eskin, and W. S. S. Noble (2002). The spectrum kernel: a string kernel for SVM protein classification. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 564-575.
44
Kernel-based distance matrix
For kernel distance formula: Scholkopf, B. (2000). The kernel trick for distances. In NIPS, pp. 301-307.
d(x, y)2 = k(x, x) + k(y, y) – 2k(x,y)
45
Tree
Towers are outliers and their placement may not accurate.
46
Similarity between patient samples & tower samples
towers/002-1 towers/006-1 towers/010-1
pat/ID120206 11187 12528 11329
pat/ID120368 11168 12513 11315
pat/ID120369 11282 12617 11427
pat/ID120370 11272 12613 11421
pat/ID120371 11289 12621 11434
pat/ID120713 11225 12566 11368
pat/KID119442 11092 12445 11239
pat/KID119444 11097 12449 11244
pat/KID119445 11117 12468 11261
pat/KID119536 11138 12488 11287
pat/KID119537 11175 12518 11321
pat/KID119788 11193 12536 11336
pat/KID119957 11092 12445 11239
pat/KID119958 11144 12494 11292
pat/KID119960 11265 12602 11408
pat/KID120069 11089 12442 11236
pat/KID120070 11154 12501 11299
pat/KID120071 11116 12467 11261
pat/KID120111 11219 12559 11365
pat/KID120112 11172 12518 11319
pat/KID120113 11357 12686 11497
pat/KID120114 11235 12577 11381
Smallest distance
47
Interactive visualization
48
Visualizing a microbiota with nucleic acid probes
Figure 2, Handelsman (2004) Microbiology and Molecular Biology Reviews 68 (4), 669-685.
49
Observation
● Visualization is important to reach out to the general public
● People like beautiful things
50
Structural metagenomics visualization
● Ray Cloud Browser● Project started to debug genome assembly code● http://genome.ulaval.ca:10208/client/● All you need is a modern web browser
51
Ray Cloud Browser: interactively skim processed genomics data with energy
Frontend: Javascript, canvas
Backend: C++
https://github.com/sebhtml/Ray-Cloud-Browser
52
Computing DNA layout for display
Barnes-Hut algorithm: Barnes, J. and P. Hut (1986, December). A hierarchical O(N log n) force-calculation algorithm. Nature 324 (6096), 446-449.
53
Evolution path: linear -> bubble -> hairy bubble -> super bubble
Onodera, T., K. Sadakane, and T. Shibuya (2013). Detecting superbubbles in assembly graphs. In A. Darling and J. Stoye (Eds.), Algorithms in Bioinformatics, Volume 8126 of Lecture Notes in Computer Science, pp. 338-348. Springer Berlin Heidelberg.
Hairy bubbles
54
Interactive too
55
Bird's view
56
Lumps
Howe, A. C., J. Pell, R. Canino-Koning, R. Mackelprang, S. Tringe, J. Jansson, J. M. Tiedje, and C. T. Brown (2012, December). Illumina sequencing artifacts revealed by connectivity analysis of metagenomic datasets.
http://dskernel.blogspot.ca/2013/01/metagenome-lumps-artifactual-mutations.html
58
Lumps
59
Lumps
60
SRS011134
● Demo (2 min): http://genome.ulaval.ca:10208/client/● Genomic DNA from stool of a male● http://sra.dnanexus.com/samples/SRS011134
62
Futures
● Genomic need more scalable & parallel software● More parallel● More push-button● Robustness● K-mer-based (paper: realtime kmers)
64
Acknowledgements
● Invitation: Nicholas J. Loman, University of Birmingham
● Arrangements: Lesley Parsons, University of Liverpool
65
Acknowledgements
● Funding: Canadian Institutes of Health Research (doctoral award)
● Compute time: Compute Canada & Calcul Québec (colosse and Mammouth Parallèle II)
● Jacques Corbeil (director) & François Laviolette (codirector)
66
Acknowledgements
● Jean-François Erdelyi (from France) for working on Ray Cloud Browser during the 2013 summer
67
Acknowledgements
● E. Godzaridis to comments and suggestions for my talk
68
Questions
● don't forget to tweet !● @sebhtml● #BeatlesAndBioinformatics