10
12/13/12 1 De novo assembly of NGS data Sandra Smit WUR bioinforma:cs PGSC De novo assembly. Why? 454 Illumina SOLiD PacBio De novo assembly. Why? 454 Illumina SOLiD Read lengths are much shorter than even the smallest genome PacBio De novo assembly. Why? 454 Illumina SOLiD Not all reads map to the reference No reference genome available PacBio

Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

1  

De  novo  assembly  of  NGS  data  

Sandra  Smit  WUR  bioinforma:cs  

PGSC

De  novo  assembly.  Why?  

454   Illumina   SOLiD  

PacBio  

De  novo  assembly.  Why?  

454   Illumina   SOLiD  

Read  lengths  are  much  shorter  than  even  the  smallest  genome  

PacBio  

De  novo  assembly.  Why?  

454   Illumina   SOLiD  

Not  all  reads  map  to  the  reference  

No  reference  genome  available  

PacBio  

Page 2: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

2  

Assemble  this!  quickb!!ckbrow!!sove!!helaz!!ownfo!!heq!!quic!!

umpso!!thequ!!bro!!nfox!!kbr!!oxjum!!erthel!!ver!

the!!azy!!ckbro!!lazydo!!thel!

ydo!!umps!!azyd!!fox!!nfo!

thelaz!!uickb!!qui!!rthela!!ydog!

!!

Find  overlap  quickb!!ckbrow!!sove!!helaz!!ownfo!!heq!!quic!!

umpso!!thequ!!bro!!nfox!!kbr!!oxjum!!erthel!!ver!

the!!azy!!ckbro!!lazydo!!thel!

ydo!!umps!!azyd!!fox!!nfo!

thelaz!!uickb!!qui!!rthela!!ydog!

!!

oxjum! umps! umpso! sove! ver! erthel! rthela! thel! thelaz! helaz! lazydo! azy! azyd! ydo! ydog  

Layout  of  the  reads  the!thequ! heq! qui! quic! quickb! uickb! ckbro! ckbrow! kbr! bro! ownfo! nfo! nfox! fox!

thequickbrownfoxjumpsoverthelazydog!

Overlap-­‐layout-­‐consensus  approach  

All-vs-all sequence similarity comparison Alignment of sequences with high sequence identity Determination of consensus sequence

ACGTCGTGTAGTGTGTATATATAAAGGCGTATAACGGA  

TATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATTG  

GATGCATCGATGTGTGTATATATAAAGGCCC  

ACGTCGTGTATCGTATATCGCGATCGGCATCGATCTT  

AGCTCATCTATTCAGCGACGAGACTCGTTATATCATCATATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA  

Page 3: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

3  

Greedy  extension  

Greedily  join  reads  that  are  most  similar  to  each  other  

Graph-­‐based  OLC  

Find  Hamiltonian  path  

The  de  Bruijn  graph  approach  

•  Solving  the  assembly  problem  with  De  Bruijn  graphs  –  Nodes  are  k-­‐mers  –  Edges  are  overlaps  between  k-­‐mers  on  k-­‐1  posi:ons  

•  A  k-­‐mer  is  a  (short)  substring  of  a  read  –  A  read  of  n  bp  consists  of  (n  –  k  +  1)  k-­‐mers  –  Example:  a  read  of  75  bp  consists  of  45  (overlapping)  31-­‐mers  

TGACCAGTG!TGAC! GACC! ACCA! CCAG! ....  

C   T   A   T   G   C   T   A   C   A   A  G   C   T   A   T   G   C   T   A   C   A  T   G   C   T   A   T   G   C   T   A   C  T   T   G   C   T   A   T   G   C   T   A  

Genera:ng  k-­‐mers  from  reads  A   C   A   T   T   G   C   T   A   T   G  C   A   T   T   G   C   T   A   T   G   C  A   T   T   G   C   T   A   T   G   C   T  A   C   A   T   T   G   C   T   A   T   G   C   T   A   C   A   A   A   T   G   A  T   A   T   G   C   T   A   C   A   A   A  A   T   G   C   T   A   C   A   A   A   T  T   G   C   T   A   C   A   A   A   T   G  G   C   T   A   C   A   A   A   T   G   A  

Page 4: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

4  

Graph-­‐based  assembly  of  k-­‐mers  

?  

?  

?  

G  

A  G  

C  A  G  

C  A  G  

A   C   A   T   T   G   C   T   A   T   G  

C   A   T   T   G   C   T   A   T   G   C  

A   T   T   G   C   T   A   T   G   C   T  

A   T   T   G   C   T   A   T   G   C   G  

G   A   C   A   T   T   G   C   T   A   T  

k-­‐mer  table  

Find  Eulerian  path  

K-­‐mer  extrac:on  and  graph  crea:on  happen  simultaneously  

Error  correc:on  is  an  important  step  

Strategies  &  Assemblers  

Zhang  et  al.  2011  

Assemble  

Just  push  the  buaon?  

Page 5: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

5  

Complica:ng  factor:  DNA  structure  DNA  is  double  

stranded  DNA  contains  

palindromes  etc.  

Complica:ng  factor:  repeats  

Repeat

Gene

EST

Genetic

FISH

Tomato  has  a  large,  repe::ve,  genome  -­‐   200-­‐250  Mb  gene-­‐rich  euchroma:n  -­‐  600-­‐700  Mb  highly  repe::ve  heterochroma:n  

-­‐  Mul:ple  overlaps  in  overlap-­‐layout-­‐consensus  approach  -­‐  Mul:ply  connected  nodes  in  graph-­‐based  approach  

Complica:ng  factor:  sequencing  errors  

Technologies  have  different  proper:es  and  error  profiles    Sequence-­‐dependent  coverage  bias    Nonuniform  error  rates  

Kircher  &  Kelso,  2010  

Filter  your  data!  

A   C   A   T   T   G   C   T   A   T   G  

C   A   T   T   G   C   T   A   T   G   C  

A   T   T   G   C   T   A   T   G   C   T  

A   T   T   G   C   T   A   T   G   C   G  

A   T   T   G   C   T   T   G   C   G  T  

12  

8  

118  

1  

n  

23  G   A   C   A   T   T   G   C   T   A   T   6  

Data  filtering  is  essen:al  to  remove:  -­‐   Sequencing  errors  -­‐   Homopolymer  stretches  -­‐   Low-­‐quality  bases    Use  for  example  a  k-­‐mer  table  -­‐  assump:on:  k-­‐mer  table      represents  the  complete      genome    

Page 6: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

6  

Complica:ng  factor:  ploidy  

•  Ploidy  is  the  number  of  sets  of  chromosomes  in  a  biological  cell  

•  Diploid:  two  sets  of  chromosomes  – Human  cells  

•  Polyploidy  means  more  than  two  sets  of  chromosomes  per  nucleus  

•  Tetraploid:  four  sets  of  chromosomes  – Common  in  plants  

Complica:ng  factor:  heterozygosity  

0.00E+00

1.00E+08

2.00E+08

3.00E+08

4.00E+08

5.00E+08

6.00E+08

7.00E+08

8.00E+08

9.00E+08

1.00E+09

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71

k-­‐mers  –  homozygous  potato  (k  =  31)  

0   1   3  2  

k-­‐mers  –  heterozygous  potato  (k  =  31)  

0.00E+00

1.00E+08

2.00E+08

3.00E+08

4.00E+08

5.00E+08

6.00E+08

7.00E+08

8.00E+08

9.00E+08

1.00E+09

1 21 41 61 81 101 121 141 161 181 201 221 241

0   1   2   3   4  

homozygous  k-­‐mers  

heterozygous  k-­‐mers  

0.00E+00

1.00E+08

2.00E+08

3.00E+08

4.00E+08

5.00E+08

6.00E+08

7.00E+08

8.00E+08

9.00E+08

1.00E+09

1 11 21 31 41 51 61 71 81 91 101 111 121

k  =  63  

Page 7: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

7  

Complica:ng  factor:  heterozygosity  

•  Heterozygous  k-­‐mers  create  “bubbles”  in  the  graph  –  Every  SNP  and  in/del  creates  a  bubble  

•  Solu:ons  –  Pinching  bubbles:  loss  of  diversity  –  Breaking  graph:  loss  of  con:guity  

Assembly  is  not  so  easy  

Assembly  result  DNA  molecule  

shotgun  reads  

con:gs  

paired-­‐end  reads  

scaffolds  

ACGTCGTGTATCGTATATCGCGATCGGCATCGATCTATTCAGCGACGAGACTCGTATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA  

ACGTCGTGTATCGTATATCGCGATCGGCA  -­‐-­‐-­‐-­‐-­‐-­‐TCTATTCAGCGACGAGACTCGTATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA  

Assembly  quality  •  Number  and  length  of  con:gs/scaffolds  

– N50  •  Number  of  gaps/Ns  •  GC  content  •  Coverage  and  completeness  •  Structural  consistency  

–  Paired-­‐end  data  –  Physical  and  gene:c  maps  

•  Error  rate  –  Known  sequences  (BAC  clones,  ESTs,  etc.)  

50%  of  assembly  N50  length:  18  Kb  N50  index:  4  

Page 8: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

8  

Basic  stats:  exercise  

•  Calculate  –  N50  con:g  size  and  index  –  N90  con:g  size  and  index  

250  

100  120  150  

20  

80  80  90  

70  40  

Con:gs  (length)  Basic  stats:  exercise  

•  Total  assembly  size  =  1000  •  Sort  from  large  to  small  

•  N50  lookup  value  =  500  •  N50  con:g  size  =  120  •  N50  con:g  index  =  3  

•  N90  con:g  size  =  70  •  N90  con:g  index  =  8  

250  

100  120  150  

20  

80  80  90  

70  40  

Con:gs  (length)  

50%  

90%  

Genome  finishing  

•  Closing  gaps  –  IMAGE  pipeline,  Tsai  et  al.  2010  –  GapFiller,  Boetzer  and  Pirovano  2012  

•  Establish  order  and  orienta:on  of  con:gs/scaffolds  –  E.g.  FISH  experiments  

•  Base  error  correc:on  –  k-­‐mer  correc:on  

•  Contamina:on  removal  

From  ini:al  drap  to  complete  genome  

Chain  et  al.,  Science  2009  Genome  Project  Standards  in  a  New  Era  of  Sequencing  

Gap  closure  GapContig A

2. local assembly of the aligned reads; new contigs are produced

3. gaps are now shortened. Repeat the whole procedure in a few iterations

4. The gap is now closed

1. align the paired end reads onto draft sequence

New contigs

Merged contig

New reads can be aligned with the presence of extended reference

Contig B

IMAGE  pipeline,  Tsai  et  al.  2010  

Page 9: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

9  

Sequencing  strategy  

Kircher  &  Kelso,  2010  

Sequencing  strategy:  which  sequencing  data  is  necessary  for  my  project?  How  can  we  balance  the  benefits  and  the  costs?  

Technologies  have  different  proper:es  and  error  profiles  

What  is  the  contribu:on  of  different  input  data  to  an  assembly?  

Sequencing  strategy  

Sequencing  strategy   Recommenda:ons  

•  Longer  reads  (Sanger,  454)  improve  assembly  quality  

•  Deep  coverage  by  reads  with  lengths  longer  than  common  repeats  

•  Mixture  of  short  and  long  insert  sizes  for  paired-­‐end  data    

Page 10: Denovo assembly.Why? Denovo assembly.Why? · seqahead_test.pptx Author: Hendrik-Jan Megens Created Date: 12/13/2012 10:22:19 PM

12/13/12  

10  

Example  

•  De  novo  assembly:  overlap-­‐layout-­‐extend  –  454  &  Sanger  shotgun  sequences  –  454  &  Sanger  matepair  sequences  

•  Base  error  correc:on  with  SOLiD  reads  •  Base  error  correc:on  with  Illumina  reads  •  Gap  filling  with  Sanger  BAC  sequences  

Summary:  assembly  workflow  

NGS  sequencing/data  produc:on  

Data  cleaning/filtering/error  correc:on  

De  novo  assembly  into  con:gs  and  (super)scaffolds  

Assembly  finishing/gap  filling  

Assembly  valida:on  

Literature  

Assembly  of  large  genomes  using  second-­‐genera:on  sequencing  Schatz  et  al.    Genome  Res.  2010  

High-­‐throughput  DNA  sequencing  -­‐-­‐  concepts  and  limita:ons  Kircher  and  Kelso    Bioessays  2010  

Assembly  algorithms  for  next-­‐genera:on  sequencing  data  Miller  et  al.    Genomics  2010  

Acknowledgements  

•  Jack  Leunissen  •  Roeland  van  Ham  •  Erwin  Datema  •  Jan  van  Haarst  

–  The  Interna:onal  Tomato  Genome  Sequencing  Consor:um  

–  Potato  Genome  Sequencing  Consor:um