Assembly by short paired-end reads Wing-Kin Sung National
University of Singapore
Slide 2
State of genome sequencing Thousands bacterial genomes plus a
few dozen higher organisms are sequenced There are still a lot of
genomes waiting for us to sequence. Personalize sequencing is also
a big market. In summary, we need cheaper and faster
sequencing.
Slide 3
Bio-technology: DNA-PETs What data is used for genome assembly?
DNA-PET is a paired-end tag extracted from the genome Each tag is
of length readlength. (e.g. readlength = 35) The span of the
DNA-PET is fixed (e.g. 1kb, 5kb, 10kb, or 20kb)
ACTCAGCACCTTACGGCGTGCATCA TACGTTCTGAACGGCAGTACAAACT Readlength Span
of the paired-end read
Slide 4
Bio-technology: DNA-PETs Some genome Sonication Size selection
Pair-end sequencing
Slide 5
Sequence Assembly Problem Given the paired-end reads, can we
assemble them to reconstruct the genome?
Slide 6
Agenda A short discussion on the data quality A brief review of
existing methods PE-Assember An example demonstrates the use of
assembly Scaffolding
Size selection is not exact Sample fragment length distribution
300bp paired-end library10,000bp mate pair library
Slide 10
Errors in DNA Sequencing Ligation errors Occur in mate-pair
libraries during library construction. Two unrelated reads are
paired together. Chr1 Chr2 5 and 3 ends of two different fragments
put together
Slide 11
Errors in DNA Sequencing Sequencing errors Caused by misreading
bases by sequencing machine. In most sequencing technologies,
sequencing errors are more likely to occur towards end of the read.
ACGTGAGGATGACACGATAGCCA ACGTGAGCATGACACGATAGCCA Actual DNA sequence
Sequence, as interpreted by machine. Machine incorrectly reads this
position as a C
Slide 12
EXISTING METHODS
Slide 13
SSAKE, VCAKE and SHARCGS Base by base 3 extension. Currently,
it can assemble short genome
Slide 14
De Bruijn graph approach Velvet, Euler-USR, Abyss, IDBA E.g.
input = {AAGACTC, ACTCCGACTG, ACTGGGAC, GGACTTT} List of 3-mers =
{AAG, AGA, GAC, ACT, CTC, TCC, CCG, CGA, CGA, CTG, TGG, GGG, GGA,
CTT, TTT} AAGACTCCGACTGGGACTTT AAGACTC ACTCCGACTG ACTGGGAC GGACTTT
Mark J. Chaisson, Dumitru Brinza and Pavel A. Pevzner. De novo
fragment assembly with short mate-paired reads: Does read length
matter? Genome Res. 19:336-346. 2009 Daniel Zerbino and Ewan
Birney. Velvet: Algorithms for De Novo Short Read Assembly Using De
Bruijn Graphs. Genome Res. 18: 821-829. 2008 200bp)656">
Simulation data E. coliS. pombeHG18 chr10
PAVelvetAllpaths2AbyssSOAPPAVelvetAllpaths2AbyssSOAPPAAbyssSOAP
Contig statistics No. of contigs
(>200bp)656442831993118116465034842624901518238 Average length
(kb)777.482606107.622.322.8394.767.975.323.13530.22.96.6 Maximum
length
(kb)2492.6708.6593.71632323519.6856.1851297.3468.7403.565.2155.8
Contig N50 size
(kb)2492.6398.3373.363.849.91487.7273226.880.199.862.45.313 Contig
N90 size (kb)2146109.9115.433.912.4363.654.459.536.71911.11.73.6
Coverage10.99590.99850.990.98870.97780.99350.9860.98380.98910.90890.92040.8714
Evaluation Large misassemblies0110110170145171605 Segment
maps0.99680.94740.99180.93310.93660.96420.94440.96830.92720.94310.86170.63610.322
Performance Total execution time
(min)2110227435101407349811748N/A240 Peak memory usage
(gb)2.32.929.72.95.94.57.76668.115.1N/A48 Velvet and Allpath2 are
not efficient enough to handle the dataset for HG18 chr10. N50
length of the assembly is defined as the length such that contigs
of equal or longer than that length account for 50% of the total
length. N90 is defined similarly. Segment map: Divide the genome
into bins of 1000bp. Count the number of bins which are the same as
the reference genome.
Slide 38
Experimental data We obtained 4 real-life datasets from
Allpath2 paper.
Slide 39
Experimental data S. aureusE. coli
PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No.
of contigs (>200bp)2460141872112125277 Average length
(kb)119.84820518.3176.837.5184.121.4 Maximum length
(kb)949.9475.61122.8175.1895.9356.61015.3160.4 Contig N50 size
(kb)685.8314.9477.263.8428.8105.6337.155.2 Contig N90 size
(kb)107.537.798431.9143.125.481.731.8
Coverage0.99450.98990.99240.98280.99560.99190.99630.9896 Evaluation
Large misassemblies05010401 Segment
maps0.98480.96660.98550.94560.98730.9560.99180.9455 Performance
Total execution time (min)1789513342522229 Peak memory usage
(gb)1.92.8202.63.36.937.65.3 S. pombeN. Crassa
PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No.
of contigs (>200bp)16936235310282708507916879916 Average length
(kb)72.133.733.81312.86.818.33.8 Maximum length
(kb)571.1443257.2136.8156.271161.256 Contig N50 size
(kb)147.7110.6503620.711.617.68.1 Contig N90 size
(kb)4033.212.212.3---1
Coverage0.96970.97820.9520.97930.8740.8770.78380.887 Evaluation
Large misassemblies3262271627318395 Segment
maps0.95510.94260.9260.91080.82060.77440.74660.7128 Performance
Total execution time (min)36412548307214162665196331 Peak memory
usage (gb)6.615N/A6.62145N/A25.6
Slide 40
Running time Single CPU, multiple core
Slide 41
EXAMPLE APPLICATION
Slide 42
Burkholderia species Burkholderia pseudomallei (Bp) Causative
agent of melioidosis, a serious infectious disease of humans and
animals with an overall fatality rate of 50% Burkholderia
thailandensis (Bt) non-pathogenic to mammals Why Bp can infect
human? Likely required for Bp to colonize and infect mammals. These
include the gain of a Bp- specific capsular polysaccharide gene
cluster. Wrinkled colonies Round colonies
Slide 43
Bt E555 My collaborator Patrick Tan thinks virulence and
nonvirulence is not a black and white issue. There should be some
intermediate state. He looked for 28 Bt strains. He finds Bt E555.
It is a mixture of smooth and wrinkled colonies. Mixture of smooth
and wrinkled colonies
Slide 44
Sequencing of Burkholderia thailandensis (Bt E555) We sequenced
Bt E555 using Solexa Genome Analyser II. 12.5M paired-end reads
Each read is of length 100bp Insert size is 130-290 We map the
sequences on the Bt reference E264.
Slide 45
De novo assembly of Bt E555 using PEassembler 521 contigs N50:
20293 bp Total length: 6145909 bp Longest contig: 72827 bp Shortest
contig: 250 bp In particular, contig 19 (24k bp) is similar to the
Bp-like CPS in Burkholderia pseudomallei. It replaces EPS.
Slide 46
Phenotype of Bp-like CPS Bp colonies are wrinkled. Bt colonies
are round and smooth BtE555 exhibited a mixture of smooth and
wrinkled colonies. BtE555 CPS KO develop round colonies with no
wrinkling. This suggested that Bp-like CPS expression may
contribute to the wrinkled colonies. Wrinkled colonies Mixture of
smooth and wrinkled colonies Round colonies
Slide 47
SCAFFOLDING
Slide 48
Formal definition of the scaffolding problem Input: A set of
contigs and edges each edge spans Output: An ordering of the
contigs s.t. the number of discordant edges is minimized Discordant
edge
Slide 49
Scaffolding problem is NP-hard Huson et al (2002) showed that
scaffolding is NP- hard. A number of heuristics solutions Celera
Assembler [Myers et al,2000] - Euler [Pevzner et al, 2001] Jazz
[Chapman et al, 2002] - Arachne [Batzoglou et al,2002] Velvet
[Zerbino et al,2008] - Bambus [Pop, et al, 2004] Can we solve the
problem optimally? Is optimal solution better?
Slide 50
A parameter width (w) Since every contig has some minimum
length and every edge span a fixed length, we expected every edge
span at most w contigs for some constant width w. At most w
contig
Slide 51
Two parts Fixed parameter polynomial time algorithm We showed
that the running time of the scaffolder depends on a parameter
width Graph Contraction We proposed a way to reduce the graph
Slide 52
Scaffolding when no discordant edge When there is no discordant
edge, a nave solution is: Enumerate all possible signed permutation
of the contigs in a tree. Prune the subtree if the scaffold is not
feasible. +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time
+A-C+B +A-C-B ABCD
Slide 53
Observation Lemma: Consider two scaffolds S 1 and S 2. If both
scaffolds share a common tail of width w, Then, both S 1 and S 2
have a feasible solution or both dont have. Proof: Based on
Bandwidth Problem [Saxe, 1980] Orientation of Nodes Direction of
Edges Discordant Edges * J. Saxe: Dynamic programming algorithms
for recognizing small-bandwidth graphs in polynomial time SIAM J.
on Algebraic and Discrete Methodd, 1(4), 363-369 (1980) Upper Bound
(W)
Slide 54
Scaffold Tail is Sufficient Analogous to Bandwidth Problem
[Saxe, 1980] Orientation of Nodes Direction of Edges Discordant
Edges * J. Saxe: Dynamic programming algorithms for recognizing
small-bandwidth graphs in polynomial time SIAM J. on Algebraic and
Discrete Methodd, 1(4), 363-369 (1980) Upper Bound (W)
Slide 55
Equivalence class of scaffolds S 1 and S 2 have the same tail
-> They are in the same class Feature of equivalence class: -
Use of the same set of contigs; - All or none of them can be
extended to a solution Tail +A-B+C +D+E -A+C +D+E+F
Slide 56
Scaffolding with p discordant edges When there are discordant
edges, we just try all possible ways to discard the p discordant
edges. Then, we run the scaffolding with no discordant edges. This
gives an O(|E| |V| w+p )-time algorithm.
Slide 57
Graph Contraction 20k
Slide 58
Graph Contraction
Slide 59
Slide 60
Gap Estimation 60 Utility Genome finishing(Genome Size
Estimation) Scaffold Correctness Calculate Gap Sizes Maximum
Likelihood Quadratic Function Solved through quadratic programming
[Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 ,, *
Goldfarb, D., Idnani, A.: A numerically stable dual method for
solving strictly convex quadratic programs. Mathematical
Programming, 27 (1983)
Slide 61
Runtime Comparison 61 E. coli B. pseudomallei S. cerevisiae D.
melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s
Simulated dataset Coverage of 2x80bp PETs with insert size 300bp:
40X Coverage of 2x50bp PETs with insert size 10kbp: 2X Contigs
assembled using Velvet Simulated datasets using MetaSim In house
data B. pseudomallei Coverage of 100bp 454 reads: 20X Coverage of
2x20kbp PETs with insert sizelibrary: 2.8X Contigs assembled using
Newbler
Reference Pramila Nuwantha Ariyaratne, Wing-Kin Sung:
PE-Assembler: de novo assembler using short paired-end reads.
Bioinformatics 27(2): 167-174 (2011) Song Gao, Niranjan Nagarajan,
Wing-Kin Sung: Opera: Reconstructing Optimal Genomic Scaffolds with
High-Throughput Paired-End Sequences. RECOMB 2011: 437-451
Slide 66
Acknowledgement Bioinformatics Zhizhuo Xueliang Chandana Rikky
Gao Song Pramila Charlie Lee Guoliang Li Han Xu Fabi Infectious
Disease Patrick Tan Sequencing group Ruan Yijun Wei Chialin Yao Fei
Liu Jun Herve Thoreau Sequencing team