Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Learning to love de Bruijn graphsBen Woodcroft,
Australian Centre for Ecogenomics (ACE)
Winter School in Bioinformatics, 2015
A slide from Torsten Seemann
K‐mers and assembly
• For next‐generation sequencing, comparison of each read with each other read is impossible.– E.g. 10 million reads ‐> 107 x 107 read‐read comparisons. Slowww..
• K‐mers and de Bruijn graphs help make things tractable
K‐mers and assembly
Forks
K‐mer too small
K‐mer too large
My favourite k‐mer size
My favourite k‐mer size
With a 100bp read, this can never happen with a k‐mer size of 51
Less tips, more bubbles
As read lengths get longer, assemblers must move from handling dead ends in the graph to handling bubbles.
Tips and bubbles
Metagenome assembly
Me: “I know, why don’t I just assemble all my data together?”
Run assemblyWait 4 daysOut of memory allocating 18.4 million terabytes of RAM.
Solutions to RAM issues
• Quality trimming• Hard trimming• Throwing away a proportion of reads
randomly• Sequencing something else
Lossy de Bruijn graphs
The number of k‐mers observed is vanishingly small relative to the total number of possible k‐mers
The human genome: ~3Gbp = ~3×109 k‐mersTotal possible 51‐mers: 451 = ~1030
0.00000000000000000002%
When making a list of k‐mers, counting extra ones probably has little effect on assembly.
Bloom filters
A low memory k‐mer “store”
Is my k‐mer in these reads?
From a bloom filter, the answer is either “no” or “probably”
A finishing approach to assembly
A central assumption of this method is that the genome is “mostly” complete
Scaffolding without mate pair data
Gap filling vs. assembly
• Regular assembly ain’t easy• Re‐assembly is more straightforward because you are trying to get to somewhere
Gap filling can correct assembly errors
• Contigs often contain errors right at the ends of contigs
• By starting to search a bit back (e.g. 200bp) away from the end of the contig, these errors can be overcome
Gap‐filling can account for strain variation
github.com/wwood/finishm
Thanks!
• Slideshare.com/benjwoodcroft
• Github.com/wwood
• Ecogenomic.org