42
De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane – Mon 7 July 2014

Torsten Seemann - de novo genome assembly

Embed Size (px)

DESCRIPTION

De novo assembly is the process of reconstructing a genome's DNA sequence using only a set of much shorter error‐prone sequences (reads) sampled from the genome. It is the "original" genomics‐based bioinformatics problem, because it is all we can do when we don't have any related reference genome sequences, with the exemplar being the original human genome project. This presentation will discuss the principles of and approaches to de novo assembly of data, and practical issues like computational and memory requirements, limitations of de novo assembly, terminology, file formats, available software, and an example run‐through of an assembly using the Velvet software if time permits. First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Citation preview

Page 1: Torsten Seemann - de novo genome assembly

De novo genome assembly

Dr Torsten Seemann

IMB Winter School - Brisbane – Mon 7 July 2014

Page 2: Torsten Seemann - de novo genome assembly

Introduction

Page 3: Torsten Seemann - de novo genome assembly

Ideal world

I would not need to give this talk!

AGTCTAGGATTCGCTACAGATTCAGGCTCTGAAGCTAGATCGCTATGCTATGATCTAGATCTCGAGATTCGTATAAGTCTAGGATTCGCTATAGATTCAGGCTCTGATATAT

Human DNA iSequencer™

46 complete haplotype

chromosome sequences

Page 4: Torsten Seemann - de novo genome assembly

Real world

•  Can’t sequence full-length native DNA –  no instrument exists (yet)

•  But we can sequence short fragments

– 100 at a time (Sanger) – 100,000 at a time (Roche 454) – 1,000,000 at a time (PGM) – 10,000,000 at a time (Proton, MiSeq) – 100,000,000 at a time (HiSeq)

Page 5: Torsten Seemann - de novo genome assembly

De novo assembly

The process of reconstructing the original DNA sequence from the fragment reads alone.

•  Instinctively like a jigsaw puzzle

– Find reads which “fit together” (overlap) – Could be missing pieces (sequencing bias) – Some pieces will be dirty (sequencing errors)

Page 6: Torsten Seemann - de novo genome assembly

An example

Page 7: Torsten Seemann - de novo genome assembly

A small “genome”

Friends, Romans, countrymen, lend me your ears;

I’ll return them

tomorrow!

Page 8: Torsten Seemann - de novo genome assembly

Shakespearomics •  Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

Oops! I dropped

them.

Page 9: Torsten Seemann - de novo genome assembly

Shakespearomics •  Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

•  Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

I’m good with words.

Page 10: Torsten Seemann - de novo genome assembly

Shakespearomics •  Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

•  Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

•  Majority consensus Friends, Romans, countrymen, lend me your ears;

We have a consensus!

Page 11: Torsten Seemann - de novo genome assembly

So far, so good.

Page 12: Torsten Seemann - de novo genome assembly

The awful truth

“Genome assembly is impossible.”

A/Prof. Mihai Pop World leader in de novo assembly research.

He wears glasses so he must be

smart :-P

Page 13: Torsten Seemann - de novo genome assembly

Methods

Page 14: Torsten Seemann - de novo genome assembly

Approaches

•  greedy assembly •  overlap :: layout :: consensus •  de Bruijn graphs •  string graphs •  seed and extend

… all essentially doing the same thing, but taking different short cuts.

Page 15: Torsten Seemann - de novo genome assembly

Assembly recipe

•  Find all overlaps between reads – hmm, sounds like a lot of work…

•  Build a graph – a picture of read connections

•  Simplify the graph – sequencing errors will mess it up a lot

•  Traverse the graph –  trace a sensible path to produce a consensus

Page 16: Torsten Seemann - de novo genome assembly

Clean graph

Page 17: Torsten Seemann - de novo genome assembly

Find read overlaps •  If we have N reads of length L

– we have to do ½N(N-1) ~ O(N²) comparisons – each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!

•  What counts as “overlapping” ? – minimum overlap length eg. 20bp – minimum %identity across overlap eg. 95% – choice depends on L and expected error rate

Page 18: Torsten Seemann - de novo genome assembly

What we are up against!

Page 19: Torsten Seemann - de novo genome assembly

What ruins the graph? •  Read errors

–  introduce false edges and nodes

•  Non-haploid organisms – heterozygosity causes lots of detours

•  Repeats –  if longer than read length – causes nodes to be shared, locality confusion

Page 20: Torsten Seemann - de novo genome assembly

Graph simplification

•  Squash small bubbles – collapse small errors (or minor heterozygosity)

•  Remove spurs

– short “dead end” hairs on the graph

•  Join unambiguously connected nodes –  reliable stretches of unique DNA

Page 21: Torsten Seemann - de novo genome assembly

Graph traversal •  For each unconnected graph

–  at least one per replicon in original sample

•  Find a path which visits each node once –  Hamiltonian path/cycle is NP-hard (this is bad) –  solution will be a set of paths which terminate at

decision points

•  Form a consensus sequences from paths –  use all the overlap alignments –  each of these collapsed paths is a contig

Page 22: Torsten Seemann - de novo genome assembly

Contigs

Contiguous, unambiguous stretches of assembled DNA sequence

•  Contigs ends correspond to – Real ends (for linear DNA molecules) – Dead ends (missing sequence) – Decision points (forks in the road)

Page 23: Torsten Seemann - de novo genome assembly

Repeats

Page 24: Torsten Seemann - de novo genome assembly

What is a repeat?

A segment of DNA which occurs more than once in the genome sequence

•  Very common – Transposons (self replicating genes) – Satellites (repetitive adjacent patterns) – Gene duplications (paralogs)

Page 25: Torsten Seemann - de novo genome assembly

Effect on assembly

The repeated element is collapsed into a single contig

Page 26: Torsten Seemann - de novo genome assembly

Repeat mis-assembly

a b c

a c b

a b c d I II III

I

II

III a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement

Page 27: Torsten Seemann - de novo genome assembly

The law of repeats

•  It is impossible to resolve repeats of length S unless you have reads longer than S.

•  It is impossible to resolve repeats of

length S unless you have reads longer than S.

Page 28: Torsten Seemann - de novo genome assembly

Scaffolding

Page 29: Torsten Seemann - de novo genome assembly

Beyond contigs

Contig sizes are limited by: •  the length of repeats in your genome

– can’t change this!

•  the length (or “span”) of the reads – wait for new technology – use “tricks” with existing technology

Page 30: Torsten Seemann - de novo genome assembly

Paired reads •  DNA fragment (200-800 bp) ==============================

•  Single end -------->=====================!

•  Paired end (up to 800 bp span) ----->==================<-----!

•  Mate pair (up to 20 kbp span) ---->========/+/=========<----!

Page 31: Torsten Seemann - de novo genome assembly

Scaffolding

•  Paired-end reads – known sequences at either end –  roughly known distance between ends – unknown sequence between ends

•  Most ends will occur in same contig –  if our contigs are longer than pair distance

•  Some ends will be in different contigs – evidence that these contigs are linked!

Page 32: Torsten Seemann - de novo genome assembly

Contigs to scaffolds

Contigs

Paired-end read

Scaffold Gap Gap

Page 33: Torsten Seemann - de novo genome assembly

Assessment

Page 34: Torsten Seemann - de novo genome assembly

Assessing assemblies

•  We desire – Total length similar to genome size – Fewer, larger contigs – No mistakes (mis-assemblies)

•  Metrics – No generally useful objective measure – Longest contig, total bp, N50, …

Page 35: Torsten Seemann - de novo genome assembly

The “N50”

The length of that contig from which 50% of the bases are in it and shorter contigs

•  Imagine we got 7 contigs with lengths: – 1,1,3,5,8,12,20

•  Total – 1+1+3+5+8+12+20 = 50

•  N50 is the “halfway sum” = 25 – 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12

Page 36: Torsten Seemann - de novo genome assembly

N50 concerns

•  Optimizing for N50 –  encourages mis-assemblies!

•  An aggressive assembler may over-join: – 1,1,3,5,8,12,20 (previous) – 1,1,3,5,20,20 (now) – 1+1+3+5+20+20 = 50 (unchanged)

•  N50 is the “halfway sum” (still 25) – 1+1+3+5+20= 30 (≥ 25) so N50 is 20 (was 12)

Page 37: Torsten Seemann - de novo genome assembly

Validation

•  Self consistency – Align read back to contigs – Check for errors or discordant pairs

•  Second opinion

– Use two complementary sequencing methods – Target troublesome areas for PCR – Use a genome wide “optical map”

Page 38: Torsten Seemann - de novo genome assembly

How can I play?

Page 39: Torsten Seemann - de novo genome assembly

Considerations •  Size of genome

– bacteria, eukaryote, meta-genome •  Hardware

– phone, laptop, desktop, server, cloud – RAM is more limiting than CPU

•  Operating system – Linux, Mac, Windows

•  Software budget –  commercial, free, open-source

Page 40: Torsten Seemann - de novo genome assembly

Recommendations •  SPAdes

– Unix command-line (Mac, Linux)

•  VAGUE (Velvet) – Unix GUI (Mac, Linux)

•  CLC Genomics Workbench

– Java GUI (Windows, Mac, Linux) – Commercial product

Page 41: Torsten Seemann - de novo genome assembly

Online tutorial

•  The GVL – Genomics Virtual Laboratory – http://genome.edu.au

•  Protocols – Microbial de novo assembly for Illumina data – Written by Simon Gladman (VBC/LSCC) – https://genome.edu.au/wiki/Protocols

Page 42: Torsten Seemann - de novo genome assembly

Contact

•  Email –  [email protected]

•  Blog

– TheGenomeFactory.blogspot.com

•  Web – vicbioinformatics.com – vlsci.org.au/lscc – genome.edu.au

Torst!

~10!