Assembly by short paired-end reads Wing-Kin Sung National University of Singapore

Embed Size (px)

Citation preview

  • Slide 1
  • Assembly by short paired-end reads Wing-Kin Sung National University of Singapore
  • Slide 2
  • State of genome sequencing Thousands bacterial genomes plus a few dozen higher organisms are sequenced There are still a lot of genomes waiting for us to sequence. Personalize sequencing is also a big market. In summary, we need cheaper and faster sequencing.
  • Slide 3
  • Bio-technology: DNA-PETs What data is used for genome assembly? DNA-PET is a paired-end tag extracted from the genome Each tag is of length readlength. (e.g. readlength = 35) The span of the DNA-PET is fixed (e.g. 1kb, 5kb, 10kb, or 20kb) ACTCAGCACCTTACGGCGTGCATCA TACGTTCTGAACGGCAGTACAAACT Readlength Span of the paired-end read
  • Slide 4
  • Bio-technology: DNA-PETs Some genome Sonication Size selection Pair-end sequencing
  • Slide 5
  • Sequence Assembly Problem Given the paired-end reads, can we assemble them to reconstruct the genome?
  • Slide 6
  • Agenda A short discussion on the data quality A brief review of existing methods PE-Assember An example demonstrates the use of assembly Scaffolding
  • Slide 7
  • QUALITY OF PAIRED-END SEQUENCING
  • Slide 8
  • Paired-end sequencing 1kb 10kb 20kb Size selection Circularize, ligation, and cut Sequencing
  • Slide 9
  • Size selection is not exact Sample fragment length distribution 300bp paired-end library10,000bp mate pair library
  • Slide 10
  • Errors in DNA Sequencing Ligation errors Occur in mate-pair libraries during library construction. Two unrelated reads are paired together. Chr1 Chr2 5 and 3 ends of two different fragments put together
  • Slide 11
  • Errors in DNA Sequencing Sequencing errors Caused by misreading bases by sequencing machine. In most sequencing technologies, sequencing errors are more likely to occur towards end of the read. ACGTGAGGATGACACGATAGCCA ACGTGAGCATGACACGATAGCCA Actual DNA sequence Sequence, as interpreted by machine. Machine incorrectly reads this position as a C
  • Slide 12
  • EXISTING METHODS
  • Slide 13
  • SSAKE, VCAKE and SHARCGS Base by base 3 extension. Currently, it can assemble short genome
  • Slide 14
  • De Bruijn graph approach Velvet, Euler-USR, Abyss, IDBA E.g. input = {AAGACTC, ACTCCGACTG, ACTGGGAC, GGACTTT} List of 3-mers = {AAG, AGA, GAC, ACT, CTC, TCC, CCG, CGA, CGA, CTG, TGG, GGG, GGA, CTT, TTT} AAGACTCCGACTGGGACTTT AAGACTC ACTCCGACTG ACTGGGAC GGACTTT Mark J. Chaisson, Dumitru Brinza and Pavel A. Pevzner. De novo fragment assembly with short mate-paired reads: Does read length matter? Genome Res. 19:336-346. 2009 Daniel Zerbino and Ewan Birney. Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 18: 821-829. 2008 200bp)656">
  • Simulation data E. coliS. pombeHG18 chr10 PAVelvetAllpaths2AbyssSOAPPAVelvetAllpaths2AbyssSOAPPAAbyssSOAP Contig statistics No. of contigs (>200bp)656442831993118116465034842624901518238 Average length (kb)777.482606107.622.322.8394.767.975.323.13530.22.96.6 Maximum length (kb)2492.6708.6593.71632323519.6856.1851297.3468.7403.565.2155.8 Contig N50 size (kb)2492.6398.3373.363.849.91487.7273226.880.199.862.45.313 Contig N90 size (kb)2146109.9115.433.912.4363.654.459.536.71911.11.73.6 Coverage10.99590.99850.990.98870.97780.99350.9860.98380.98910.90890.92040.8714 Evaluation Large misassemblies0110110170145171605 Segment maps0.99680.94740.99180.93310.93660.96420.94440.96830.92720.94310.86170.63610.322 Performance Total execution time (min)2110227435101407349811748N/A240 Peak memory usage (gb)2.32.929.72.95.94.57.76668.115.1N/A48 Velvet and Allpath2 are not efficient enough to handle the dataset for HG18 chr10. N50 length of the assembly is defined as the length such that contigs of equal or longer than that length account for 50% of the total length. N90 is defined similarly. Segment map: Divide the genome into bins of 1000bp. Count the number of bins which are the same as the reference genome.
  • Slide 38
  • Experimental data We obtained 4 real-life datasets from Allpath2 paper.
  • Slide 39
  • Experimental data S. aureusE. coli PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp)2460141872112125277 Average length (kb)119.84820518.3176.837.5184.121.4 Maximum length (kb)949.9475.61122.8175.1895.9356.61015.3160.4 Contig N50 size (kb)685.8314.9477.263.8428.8105.6337.155.2 Contig N90 size (kb)107.537.798431.9143.125.481.731.8 Coverage0.99450.98990.99240.98280.99560.99190.99630.9896 Evaluation Large misassemblies05010401 Segment maps0.98480.96660.98550.94560.98730.9560.99180.9455 Performance Total execution time (min)1789513342522229 Peak memory usage (gb)1.92.8202.63.36.937.65.3 S. pombeN. Crassa PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp)16936235310282708507916879916 Average length (kb)72.133.733.81312.86.818.33.8 Maximum length (kb)571.1443257.2136.8156.271161.256 Contig N50 size (kb)147.7110.6503620.711.617.68.1 Contig N90 size (kb)4033.212.212.3---1 Coverage0.96970.97820.9520.97930.8740.8770.78380.887 Evaluation Large misassemblies3262271627318395 Segment maps0.95510.94260.9260.91080.82060.77440.74660.7128 Performance Total execution time (min)36412548307214162665196331 Peak memory usage (gb)6.615N/A6.62145N/A25.6
  • Slide 40
  • Running time Single CPU, multiple core
  • Slide 41
  • EXAMPLE APPLICATION
  • Slide 42
  • Burkholderia species Burkholderia pseudomallei (Bp) Causative agent of melioidosis, a serious infectious disease of humans and animals with an overall fatality rate of 50% Burkholderia thailandensis (Bt) non-pathogenic to mammals Why Bp can infect human? Likely required for Bp to colonize and infect mammals. These include the gain of a Bp- specific capsular polysaccharide gene cluster. Wrinkled colonies Round colonies
  • Slide 43
  • Bt E555 My collaborator Patrick Tan thinks virulence and nonvirulence is not a black and white issue. There should be some intermediate state. He looked for 28 Bt strains. He finds Bt E555. It is a mixture of smooth and wrinkled colonies. Mixture of smooth and wrinkled colonies
  • Slide 44
  • Sequencing of Burkholderia thailandensis (Bt E555) We sequenced Bt E555 using Solexa Genome Analyser II. 12.5M paired-end reads Each read is of length 100bp Insert size is 130-290 We map the sequences on the Bt reference E264.
  • Slide 45
  • De novo assembly of Bt E555 using PEassembler 521 contigs N50: 20293 bp Total length: 6145909 bp Longest contig: 72827 bp Shortest contig: 250 bp In particular, contig 19 (24k bp) is similar to the Bp-like CPS in Burkholderia pseudomallei. It replaces EPS.
  • Slide 46
  • Phenotype of Bp-like CPS Bp colonies are wrinkled. Bt colonies are round and smooth BtE555 exhibited a mixture of smooth and wrinkled colonies. BtE555 CPS KO develop round colonies with no wrinkling. This suggested that Bp-like CPS expression may contribute to the wrinkled colonies. Wrinkled colonies Mixture of smooth and wrinkled colonies Round colonies
  • Slide 47
  • SCAFFOLDING
  • Slide 48
  • Formal definition of the scaffolding problem Input: A set of contigs and edges each edge spans Output: An ordering of the contigs s.t. the number of discordant edges is minimized Discordant edge
  • Slide 49
  • Scaffolding problem is NP-hard Huson et al (2002) showed that scaffolding is NP- hard. A number of heuristics solutions Celera Assembler [Myers et al,2000] - Euler [Pevzner et al, 2001] Jazz [Chapman et al, 2002] - Arachne [Batzoglou et al,2002] Velvet [Zerbino et al,2008] - Bambus [Pop, et al, 2004] Can we solve the problem optimally? Is optimal solution better?
  • Slide 50
  • A parameter width (w) Since every contig has some minimum length and every edge span a fixed length, we expected every edge span at most w contigs for some constant width w. At most w contig
  • Slide 51
  • Two parts Fixed parameter polynomial time algorithm We showed that the running time of the scaffolder depends on a parameter width Graph Contraction We proposed a way to reduce the graph
  • Slide 52
  • Scaffolding when no discordant edge When there is no discordant edge, a nave solution is: Enumerate all possible signed permutation of the contigs in a tree. Prune the subtree if the scaffold is not feasible. +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time +A-C+B +A-C-B ABCD
  • Slide 53
  • Observation Lemma: Consider two scaffolds S 1 and S 2. If both scaffolds share a common tail of width w, Then, both S 1 and S 2 have a feasible solution or both dont have. Proof: Based on Bandwidth Problem [Saxe, 1980] Orientation of Nodes Direction of Edges Discordant Edges * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980) Upper Bound (W)
  • Slide 54
  • Scaffold Tail is Sufficient Analogous to Bandwidth Problem [Saxe, 1980] Orientation of Nodes Direction of Edges Discordant Edges * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980) Upper Bound (W)
  • Slide 55
  • Equivalence class of scaffolds S 1 and S 2 have the same tail -> They are in the same class Feature of equivalence class: - Use of the same set of contigs; - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F
  • Slide 56
  • Scaffolding with p discordant edges When there are discordant edges, we just try all possible ways to discard the p discordant edges. Then, we run the scaffolding with no discordant edges. This gives an O(|E| |V| w+p )-time algorithm.
  • Slide 57
  • Graph Contraction 20k
  • Slide 58
  • Graph Contraction
  • Slide 59
  • Slide 60
  • Gap Estimation 60 Utility Genome finishing(Genome Size Estimation) Scaffold Correctness Calculate Gap Sizes Maximum Likelihood Quadratic Function Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 ,, * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)
  • Slide 61
  • Runtime Comparison 61 E. coli B. pseudomallei S. cerevisiae D. melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s Simulated dataset Coverage of 2x80bp PETs with insert size 300bp: 40X Coverage of 2x50bp PETs with insert size 10kbp: 2X Contigs assembled using Velvet Simulated datasets using MetaSim In house data B. pseudomallei Coverage of 100bp 454 reads: 20X Coverage of 2x20kbp PETs with insert sizelibrary: 2.8X Contigs assembled using Newbler
  • Slide 62
  • Scaffold Contiguity 62
  • Slide 63
  • Scaffold Correctness 63
  • Slide 64
  • Scaffold Correctness 64 E.coliS. cerevisiaeD. melanogaster Opera134 Bambus1955423
  • Slide 65
  • Reference Pramila Nuwantha Ariyaratne, Wing-Kin Sung: PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27(2): 167-174 (2011) Song Gao, Niranjan Nagarajan, Wing-Kin Sung: Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. RECOMB 2011: 437-451
  • Slide 66
  • Acknowledgement Bioinformatics Zhizhuo Xueliang Chandana Rikky Gao Song Pramila Charlie Lee Guoliang Li Han Xu Fabi Infectious Disease Patrick Tan Sequencing group Ruan Yijun Wei Chialin Yao Fei Liu Jun Herve Thoreau Sequencing team