Ray and Ray Cloud Browser for Metagenomics17 Software should be parallel too Highly parallel genomic...

Preview:

Citation preview

1

Ray and Ray Cloud Browser for Metagenomics

Sébastien Boisvert @sebhtmlUniversité Laval, Québec, Canada

Beatles and Bioinformatics! #BeatlesAndBioinformatics University of Liverpool

27th November 2013 13:00

Talk: 40 minutesQuestions: 5 min

2

Where is Laval University ?

In Québec City

3

Canada is in the Commonwealth of Nations too !

● Canadian money

Photo: http://www.bridgeandtunnelclub.com/bigmap/outoftown/canada/money/

4

Super computing at Laval University

colosse#314 top500 06/20127616 Intel Xeon X5560 coresMellanox Technologies MT26428332 kW

5

Plan

● Background● Parallelism● Ray & metagenomics● Compare samples with Surveyor● Interactive visualization● Futures

6

Background

7

We buy sequencers and computers but...● We have:

– DNA sequencers to read genetic code (parallel)

– Supercomputers to compute stuff in the general sense (parallel)

Mardis, E. R. (2011, February). A decade/'s perspective on DNA sequencing technology. Nature 470 (7333), 198-203.

Sanger, F., S. Nicklen, and A. R. Coulson (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74 (12), 5463-5467.

Shendure, J. and H. Ji (2008, October). Next-generation DNA sequencing. Nature Biotechnology 26 (10), 1135-1145.

Sanger, F. (2001, March). The early days of DNA sequences. Nat Med 7 (3), 267-268.

Afuah, A. N. and J. M. Utterback (1991, December). The emergence of a new supercomputer architecture. Technological Forecasting and Social Change 40 (4), 315-328.

8

Trend

● However:– Genomics need more parallel software that scale with

biology's huge problems

Pollack, A. (2011). DNA sequencing caught in deluge of data. New York Times 1.

Baker, M. (2010, July). Next-generation sequencing: adjusting to data overload. Nature Methods 7 (7), 495-499.

Trelles, O., P. Prins, M. Snir, and R. C. Jansen (2011, February). Big data, but are we ready? Nature Reviews Genetics 12 (3), 224.

(2013, October). In need of an upgrade. Nature Biotechnology 31 (10), 857.

McPherson, J. D. (2009, November). Next-generation gap. Nature Methods 6 (11 Suppl), S2-S5.

Mardis, E. (2010). The $1,000 genome, the $100,000 analysis? Genome Medicine 2 (11), 84+.

9

I created some useful software

● Ray genome assembly, metagenome assembly, taxonomic profiling, sample comparison

● RayPlatform platform on which Ray is built● Ray Cloud Browser visualization of large genome

graphs

10

In this talk

● Ray (C++, started with bacterial genome assembly)● Ray Meta (assembling metagenomes with Ray)● Ray Communities (profiling metagenomes with

Ray)● Ray Surveyor (comparing DNA sequencing samples

without reference; Ray -run-surveyor)● Ray Cloud Browser (separate project )

11

Our original idea in 2010

● Mixing reads from different technologies (454 + Illumina)

● 2010 paper about Ray heuristics:

Boisvert, S., F. Laviolette, and J. Corbeil (2010, November). Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of Computational Biology 17 (11), 1519-1533.

12

Mixing sequencing reads

Figure from: Journal of Computational Biology 17 (11), 1519-1533.

13

Platform

● Goal: build a platform for distributed genomic computing

● Thread-based programming is hard● Message passing is easy to understand, scales. but

harder to program● Solution: framework to abstract everything

14

Platform perks

● Plugin interface● Actor model interface

● Runtimes:– Actor playground

– Standard mode

– Mini-ranks

15

RayPlatform's scalability

● Ray is scalability is measurable

Sample SRS011098 from Human Microbiome Project (202 487 723 reads)

Figure from:

Godzaridis, Boisvert, et al. Big Data (accepted)

16

Parallelism

17

Software should be parallel too

● Highly parallel genomic assays

Nature Reviews Genetics 7, 632-644 (August 2006)

● Couple of reviews about need for speed

Flicek, P. (2009, March). The need for speed. Genome biology 10 (3), 1-4.

Bonetta, L. (2006, February). Genome sequencing in the fast lane. Nature Methods 3 (2), 141-147.

Schatz, M. C., B. Langmead, and S. L. Salzberg (2010, July). Cloud computing and the DNA data race. Nature Biotechnology 28 (7), 691-693.

18

What is concurrency

● Several actions performed simultaneously during a period of time

● Example: give 1000000 sequences to 10 computers: each processes 100000 seq. simultaneously

● Threads are local to 1 computer● Processes can be distributed

19

Actor model for programming genomic tools

● In a nutshell: actors send messages to each other and can spawn actors

● Video: http://channel9.msdn.com/Shows/Going+Deep/Hewitt-Meijer-and-Szyperski-The-Actor-Model-everything-you-wanted-to-know-but-were-afraid-to-ask

Hewitt, C., P. Bishop, and R. Steiger (1973). A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the 3rd international joint conference on Artificial intelligence, IJCAI'73, San Francisco, CA, USA, pp. 235-245. Morgan Kaufmann Publishers Inc.

Agha, G. (1986). Actors: a model of concurrent computation in distributed systems. Cambridge, MA, USA: MIT Press.

21

Ray & metagenomics

22

Metagenomics (started in 1998)

● DNA sequencing is cheap● Bacteria in complex communities can not be

cultured easily● Metagenomics: direct DNA sequencing from

uncultured microorganisms● Field started by Jo Handelsman in 1998

Handelsman, J. (2004, December). Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews 68 (4), 669-685.

The microbiome explored: recent insights and future challenges. Blaser, Bork, Fraser, Knight & Wang Nature Reviews Microbiology 11, 213-217 (March 2013)

Handelsman et al. (Oct 1998) Chemistry & biology 5 (10).

23

Existing metagenomic tools do ABC, we do XYZ

● Metagenomic sequencing data must be analyzed● Methods A, B, C (16S = metagenomics)● We propose X, Y and Z (whole genome shotgun + k-mers)

● Also, so many choices (tools, sequencers), most do ABC, we do XYZ

Loman, N. J., C. Constantinidou, J. Z. Chan, M. Halachev, M. Sergeant, C. W. Penn, E. R. Robinson, and M. J. Pallen (2012, September). High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nature Reviews Microbiology 10 (9), 599-606.

Kahvejian, A., J. Quackenbush, and J. F. Thompson (2008, October). What would you do if you could sequence everything? Nature Biotechnology 26 (10), 1125-1133.

Metagenomics: DNA sequencing of environmental samples Nature Reviews Genetics 6, 805-814 (November 2005)

24

Some concepts

● Taxonomy: the branch of science concerned with classification, especially of organisms; systematics.

● Taxon: taxonomic group● Taxonomic tree: a tree of taxon● Leaf: a tree node without children● OTU: operational taxonomic unit

25

Taxonomic profiling with kmers

● Kmers: DNA words of length k● Given (1) a taxonomic tree and (2) data (usually

reads or kmers) on the tree's leaves● LCA: Last Common Ancestor to classify each kmer

to a node (possibly not a leaf)● Colored = labeled with a taxon or genome identifier

26

Examples

● Annotation with k-mers: Edwards, R. A., R. Olson, T. Disz, G. D. Pusch, V. Vonstein, R. Stevens, and R. Overbeek (2012, December). Real time metagenomics: using k-mers to annotate metagenomes. Bioinformatics (Oxford, England) 28 (24), 3316-3317.

● “Ray Communities” => Boisvert et al. 2012 Genome Biology● Scalable taxonomic assignation: Ames, S. K., D. A. Hysom, S. N. Gardner, G. S.

Lloyd, M. B. Gokhale, and J. E. Allen (2013, September). Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29 (18), 2253-2260.

27

Profile with kmers using Ray Communities

● Genome abundance● Taxon abundance (good correlation with Metaphlan)● Gene Ontology

28

UniFrac is mathematically sound

● Use taxon profiles● UniFrac: distance between 2 community samples

Lozupone, C. and R. Knight (2005, December). UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71 (12), 8228-8235.

29

Ray Meta

● “Ray Meta” => metagenome assembly with Ray● Binning with coverage may not accurate because

coverage depth changes with GC content and other factors

● Ray trick: instead of binning with coverage, bin with graph seeds (locality)

Boisvert, S., F. Raymond, E. Godzaridis, F. Laviolette, and J. Corbeil (2012, December). Ray meta: scalable de novo metagenome assembly and profiling. Genome Biology 13 (12), R122+.

● http://genomebiology.com/2012/13/12/R122

30

Assembled proportions of bacterial genomes for a simulated metagenome

with sequencing errors

1000 bacterial genomes with power law distribution3*10^9 readsSimulated errorsFigure 1, Boisvert et al. 2012 Genome Biology

Good assembly proportion of contained genomes within metagenome

31

Estimated bacterial genome

proportions● With kmer● Uniquely-colored k-

mers

A: 100-genome metagenome

B: 1000-genome metagenome

Figure 2, Boisvert et al. 2012 Genome Biology

32

Enterotypes

● 3 enterotypes:Arumugam, M. (...) and P. Bork (2011, April). Enterotypes of the human gut microbiome. Nature 473 (7346), 174-180.

● 2 enterotypes:Wu, G. D. (...) and J. D. Lewis (2011, October). Linking long-term dietary patterns with gut microbial enterotypes. Science (New York, N.Y.) 334 (6052), 105-108.

● Can we reproduce that with k-mers-based classification ?

33

Reproduction of enterotypes with k-mer based profiling

● Data: Qin et al. 2010 Nature (MetaHIT)

Figure 4, Boisvert et al. 2012 Genome Biology

34

Some quotes

● Snake assembly in Assemblathon 2:

“The Ray assembly was ranked 1st overall, and also ranked 1st for all individual measures except multiplicity (where it still had a better than average performance). “ GigaScience 2013, 2:10

● E. coli sequencing on MiSeq:

“Ray stood apart as the most accurate of the three assemblers, based on the number of inversions, relocations, SNPs, and a visual inspection of the associated dot plots” BMC Genomics 2013, 14:675

● “Ray will be a good validation assembler” Bastien Chevreux (Mira assembler author) http://article.gmane.org/gmane.science.biology.ray-genome-assembler/696

36

Using a graph to mine variation

Bubble caused by variation or sequencing error

37

Comparing metagenome samples

● Idea: compare samples without a reference● Be it variants, or kmer content● For kmer presence/absence, don't use coverage● For RNA-Seq or taxon abundances, compare

normalized kmer counts

38

Compare genomic content without a ref. with Surveyor

● Set of biological samples● DNA sequencing for each● Use Actor Model to compare a lot of samples● Build a de Bruijn graph that contains all of them (à

la fermi or Cortex), but distributed● In development

Iqbal, Z., I. Turner, and G. McVean (2013, January). High-throughput microbial population genomics using the cortex variation assembler. Bioinformatics 29 (2), 275-276. Cortex for microbial populations

Iqbal, Z., M. Caccamo, I. Turner, P. Flicek, and G. McVean (2012, February). De novo assembly and genotyping of variants using colored de bruijn graphs. Nature Genetics 44 (2), 226-232. Cortex

Li, H. (2012, July). Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28 (14), 1838-1844. Fermi

39

Ray -run-surveyor

● Existing methods enumerate variation entries● Genomic word content may also be interesting● Compare many samples (their kmer content)

40

Legionella

● 2012 outbreak in Quebec City● What's the source of contamination ?● 3 suspect cooling towers● On the Illumina MiSeq

41

Samples

● 22 patient-samples● 3 source-tower-samples (metagenomic)● 2 epidemic-strain-environmental-samples● 7 environmental-samples● 4 contemporaneous-samples● 5 old-1996-samples

42

Questions

● Are the 2012 strains similar to the 1996 (also in Québec City) strains ?

● Which cooling tower is the most-likely source of contamination ?

43

Similarity matrix (k spectrum kernel)

Ref. For spectrum kernel: Leslie, C., E. Eskin, and W. S. S. Noble (2002). The spectrum kernel: a string kernel for SVM protein classification. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 564-575.

44

Kernel-based distance matrix

For kernel distance formula: Scholkopf, B. (2000). The kernel trick for distances. In NIPS, pp. 301-307.

d(x, y)2 = k(x, x) + k(y, y) – 2k(x,y)

45

Tree

Towers are outliers and their placement may not accurate.

46

Similarity between patient samples & tower samples

towers/002-1 towers/006-1 towers/010-1

pat/ID120206 11187 12528 11329

pat/ID120368 11168 12513 11315

pat/ID120369 11282 12617 11427

pat/ID120370 11272 12613 11421

pat/ID120371 11289 12621 11434

pat/ID120713 11225 12566 11368

pat/KID119442 11092 12445 11239

pat/KID119444 11097 12449 11244

pat/KID119445 11117 12468 11261

pat/KID119536 11138 12488 11287

pat/KID119537 11175 12518 11321

pat/KID119788 11193 12536 11336

pat/KID119957 11092 12445 11239

pat/KID119958 11144 12494 11292

pat/KID119960 11265 12602 11408

pat/KID120069 11089 12442 11236

pat/KID120070 11154 12501 11299

pat/KID120071 11116 12467 11261

pat/KID120111 11219 12559 11365

pat/KID120112 11172 12518 11319

pat/KID120113 11357 12686 11497

pat/KID120114 11235 12577 11381

Smallest distance

47

Interactive visualization

48

Visualizing a microbiota with nucleic acid probes

Figure 2, Handelsman (2004) Microbiology and Molecular Biology Reviews 68 (4), 669-685.

49

Observation

● Visualization is important to reach out to the general public

● People like beautiful things

50

Structural metagenomics visualization

● Ray Cloud Browser● Project started to debug genome assembly code● http://genome.ulaval.ca:10208/client/● All you need is a modern web browser

51

Ray Cloud Browser: interactively skim processed genomics data with energy

Frontend: Javascript, canvas

Backend: C++

https://github.com/sebhtml/Ray-Cloud-Browser

52

Computing DNA layout for display

Barnes-Hut algorithm: Barnes, J. and P. Hut (1986, December). A hierarchical O(N log n) force-calculation algorithm. Nature 324 (6096), 446-449.

53

Evolution path: linear -> bubble -> hairy bubble -> super bubble

Onodera, T., K. Sadakane, and T. Shibuya (2013). Detecting superbubbles in assembly graphs. In A. Darling and J. Stoye (Eds.), Algorithms in Bioinformatics, Volume 8126 of Lecture Notes in Computer Science, pp. 338-348. Springer Berlin Heidelberg.

Hairy bubbles

54

Interactive too

55

Bird's view

56

Lumps

Howe, A. C., J. Pell, R. Canino-Koning, R. Mackelprang, S. Tringe, J. Jansson, J. M. Tiedje, and C. T. Brown (2012, December). Illumina sequencing artifacts revealed by connectivity analysis of metagenomic datasets.

http://dskernel.blogspot.ca/2013/01/metagenome-lumps-artifactual-mutations.html

58

Lumps

59

Lumps

60

SRS011134

● Demo (2 min): http://genome.ulaval.ca:10208/client/● Genomic DNA from stool of a male● http://sra.dnanexus.com/samples/SRS011134

62

Futures

● Genomic need more scalable & parallel software● More parallel● More push-button● Robustness● K-mer-based (paper: realtime kmers)

64

Acknowledgements

● Invitation: Nicholas J. Loman, University of Birmingham

● Arrangements: Lesley Parsons, University of Liverpool

65

Acknowledgements

● Funding: Canadian Institutes of Health Research (doctoral award)

● Compute time: Compute Canada & Calcul Québec (colosse and Mammouth Parallèle II)

● Jacques Corbeil (director) & François Laviolette (codirector)

66

Acknowledgements

● Jean-François Erdelyi (from France) for working on Ray Cloud Browser during the 2013 summer

67

Acknowledgements

● E. Godzaridis to comments and suggestions for my talk

68

Questions

● don't forget to tweet !● @sebhtml● #BeatlesAndBioinformatics

Recommended