22
Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for Ecogenomics (ACE) Winter School in Bioinformatics, 2015

Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Learning to love de Bruijn graphsBen Woodcroft,

Australian Centre for Ecogenomics (ACE)

Winter School in Bioinformatics, 2015

Page 2: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

A slide from Torsten Seemann

Page 3: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

K‐mers and assembly

• For next‐generation sequencing, comparison of each read with each other read is impossible.– E.g. 10 million reads ‐> 107 x 107 read‐read comparisons. Slowww..

• K‐mers and de Bruijn graphs help make things tractable

Page 4: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

K‐mers and assembly

Page 5: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Forks

Page 6: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

K‐mer too small

Page 7: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

K‐mer too large

Page 8: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

My favourite k‐mer size

Page 9: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

My favourite k‐mer size

With a 100bp read, this can never happen with a k‐mer size of 51

Page 10: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Less tips, more bubbles

As read lengths get longer, assemblers must move from handling dead ends in the graph to handling bubbles.

Page 11: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Tips and bubbles

Page 12: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Metagenome assembly

Me: “I know, why don’t I just assemble all my data together?”

Run assemblyWait 4 daysOut of memory allocating 18.4 million terabytes of RAM.

Page 13: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Solutions to RAM issues

• Quality trimming• Hard trimming• Throwing away a proportion of reads 

randomly• Sequencing something else

Page 14: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Lossy de Bruijn graphs

The number of k‐mers observed is vanishingly small relative to the total number of possible k‐mers

The human genome: ~3Gbp = ~3×109 k‐mersTotal possible 51‐mers: 451 = ~1030

0.00000000000000000002%

When making a list of k‐mers, counting extra ones probably has little effect on assembly.

Page 15: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Bloom filters

A low memory k‐mer “store”

Page 16: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Is my k‐mer in these reads?

From a bloom filter, the answer is either “no” or “probably”

Page 17: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

A finishing approach to assembly

A central assumption of this method is that the genome is “mostly” complete

Page 18: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Scaffolding without mate pair data

Page 19: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Gap filling vs. assembly

• Regular assembly ain’t easy• Re‐assembly is more straightforward because you are trying to get to somewhere

Page 20: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Gap filling can correct assembly errors

• Contigs often contain errors right at the ends of contigs

• By starting to search a bit back (e.g. 200bp) away from the end of the contig, these errors can be overcome

Page 21: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Gap‐filling can account for strain variation

github.com/wwood/finishm

Page 22: Learning to love de Bruijn graphs - bioinformatics.org.aubioinformatics.org.au/ws/wp-content/uploads/sites/...Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for

Thanks!

• Slideshare.com/benjwoodcroft

• Github.com/wwood

• Ecogenomic.org