Upload
daniela-bryant
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Sequencing and Assembly
GEN875, Genomics and Proteomics, Fall 2010
History of DNA Sequencing History of DNA Sequencing
Avery: Proposes DNA as ‘Genetic Material’
Watson & Crick: Double Helix Structure of DNA
Holley: Sequences Yeast tRNAAla
1870
1953
1940
1965
1970
1977
1980
1990
2002
Miescher: Discovers DNA
Wu: Sequences Cohesive End DNA
Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation
Messing: M13 Cloning
Hood et al.: Partial Automation
• Cycle Sequencing • Improved Sequencing Enzymes• Improved Fluorescent Detection Schemes
1986
• Next Generation Sequencing•Improved enzymes and chemistry•Improved image processing
Adapted from Francis Oulette; Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
Adapted from Francis Oulette; Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
1
15
150
50,000
25,000
1,500
200,000
50,000,000
Efficiency(bp/person/year)
15,000
100,000,000,000 2008
Explosive Growth in Sequencing
8/22/2005Press Release:
INSD (GenBank, EMBL, DDBJ)
reaches 100 Gigabase
milestone
What do we sequence?
• Genomes (de novo, resequencing)
• Metagenomes or complex samples
• Transcripts
• Fragments recovered by chIP or tagged in some other way
NCBI Genomes http://www.ncbi.nlm.nih.gov/Genomes/
Eukaryotic Genomes:
Complete 23, 25, 22, 20 Assembly 230, 162, 109, 72In progress 229, 235, 299, 166
Prokaryotic Genomes:
Complete 745, 567, 371, 254In progress 1215, 841, 615, 433
Comparison of data from 9/4/08, 9/5/07, 9/4/06 and 8/31/05
NCBI Genomes 9/6/2010
Sequencing Platforms• Sanger sequencing and capillary electrophoresis
• Massively parallel pyrosequencing (454)
• “proprietary Clonal Single Molecule Array technology and novel reversible terminator-based sequencing” (Illumina)
• Sequencing by ligation (ABI SOLiD)
• Single molecule sequencing (PacBio)
Basics of the “old” technology• Clone the DNA.• Generate a ladder of labeled (colored) molecules that
are different by 1 nucleotide.• Separate mixture on some matrix.• Detect fluorochrome by laser.• Interpret peaks as string of DNA.• Strings are 500 to 1,000 letters long• 1 machine generates 57,000 nucleotides/run• Assemble all strings into a genome.
Adapted from Francis Oulette
Sample
Isolate DNA
Physical fragmentationSize
selection
Ligate randomly into vectors
Transformation Plate on agar
Pick and grow individual colonies
Isolate cloned constructs
Cycle Sequencing
High-throughput Steps
Library construction and sequencing
Dual Ended Sequencing Can Provide Information to Link Contigs
5 Kb insert
PrimerA
PrimerB
Sequencing with primers that begin in the vector on either side of the insert yields about 800 bp of DNA sequence from each end of the insert
The middle of the insert is never sequenced for most clones used in the project
Basics of the “new” technology
• Get DNA.• Attach it to something.• Extend and amplify signal with some labeling scheme.• Detect fluorochrome by microscopy.• Interpret series of spots as short strings of DNA.• Strings are 30-300 letters long• Multiple images are interpreted as 0.4 to 1.2 GB/run
(1,200,000,000 letters/day). • Map or align strings to one or many genome or assemble.
Adapted from Francis Oulette
Differences between the various platforms:
• Nanotechnology used.
• Resolution of the image analysis.
• Chemistry and enzymology.
• Signal to noise detection in the software
• Software/images/file size/pipeline
• Cost $$$
Adapted from Francis Oulette
Next Generation DNA Sequencing Technologies
Genome size: 3000 MbReq'd coverage: 6 12 25
3730 454 FLX Solexabp/read 600 400 32Reads/run 96 400,000 40,000,000 bp/run 57,600 160,000,000 1,280,000,000 #/runs req'd 312,500 225 59
Cost per run 48$ 6,800$ 9,300$ Total cost 15,000,000$ 1,530,000$ 544,922$
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
3 Gb ==
Roche 454
Genome sequencing in microfabricated high-density picolitre reactors Margulies, M. Eghold, M. et al. Nature. 2005 Sep 15; 437(7057):326-7
Pyrosequencing
Throughput400-600 million high-quality, filter-passed bases per run* 1 billion bases per day
Run Time 10 hours
Read Length Average length = 400 bases
AccuracyQ20 read length of 400 bases (99% at 400 bases and higher for prior bases)
Reads per run >1 million high-quality reads
GS FLX Titanium Series:
The GS FLX Titanium series reagents run on the Genome Sequencer FLX Instrument, a system based on 454's sequencing-by-synthesis technology. The GS FLX Titanium series improves on the current system with upgraded reagents, consumables, and software.
GS FLX Titanium Series:
Solexa-based Whole Genome SequencingAdapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Solexa-based Whole Genome Sequencing
Solexa flow cell
~50M clusters are sequenced per flow cell.
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Genome-scale Sequence Analysis
• De novo assembly
• Templated assembly
• Read mapping or alignment to a reference genome
“The choice of alignment or assembly algorithm is strongly influencedby both the experiment in question and the details of thesequencing technology used. The performance characteristics ofthe sequencing machines are changing rapidly, and any delineationof performance characteristics such as machine capacity, runtime or read length and its relationship to error profile will quicklybe outdated.”
Assemblers
• Greedy Assemblers – compare all reads to each other then join them in order of overlap size
Figure 8. Greedy assembly of four reads.
Assemblers
• Overlap Graph Assemblers – make a graph where each node represents a read and edges between them represent overlaps.
Figure 9. Overlap graph for a bacterial genome. The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right). The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)
Assembly with dual-ended sequencing
Sequence assembly
Contigs linked by a spanning clone
Contigs joined by overlaps
Scaffold – two or more linked contigs
Repeat handling
• Screen out known repeats and set them aside for later
• Infer repetitiveness based on coverage
• First assemble unambiguous overlaps, then resolve repeats using mate pairs
Assemblers and short reads
• Full overlap assemblers compare all reads against all other reads. Scale quadratically with the number of reads.
• Computationally intractable for large NSG datasets
• Led to development of k-mer based methods: a de Bruijn graph with a node for every k-mer observed in the sequence set and an edge between nodes if these two k-mers are observed adjacently in a read
Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.
Figure 1. The k-mer uniqueness ratio for five well-known organisms andone single-celled human parasite. The ratio is defined here as the percentageof the genome that is covered by unique sequences of length k orlonger. The horizontal axis shows the length in base pairs of the sequences.For example, ;92.5% of the grapevine genome is contained inunique sequences of 100 bp or longer.
De Bruijn k-mer assemblers
• Newbler (Roche 454)
• SHARCGS• VCAKE• VELVET• EULER-SR• EDENA• ABySS
• ALLPATHS• SOAPdenovo• Contrail
• Most assemblers have an error detection and resolution phase
• Errors produce characteristic graphic structures
Problems with de Bruijn graph methods
• Require large amount of memory to store graph – for example Velvet would require a terabyte of memory to assemble the human genome
• Not as easy to parallelize as overlap assemblers
From Shatz et al. 2010:“To date, only two de Bruijn graph assemblers have been shown to have the
ability to assemble a mammalian-sized genome. ABySS (Simpson et al. 2009) assembled a human genome in 87 h on a cluster of 21 eight-core machines each with 16 GB of RAM (168 cores, 336 GB of RAM total). SOAPdenovo assembled a human genome in 40 h using a single computer with 32 cores and 512 GB of RAM (Li et al. 2010). Although these types of computing resources are not widely available, they are within reach for large-scale scientific centers.”
How many clones/reads do we need?
…according to the work of Lander and Waterman (Genomic mapping by fingerprinting random clones: a mathematical analysis.Genomics. 1988
Apr;2(3):231-9.), the number of “islands” or contigs formed from randomly collected sequences depends on:
G = Genome LengthL = Sequence Read LengthN = Number of Sequences CollectedT = Number of Basepairs of Overlap Needed
# Islands = NeLNG
TL(- ( 1 - ))
5 Mbp Genome, 500 bp reads, 25 bp overlap
• # readscoverage % sequenced # contigs• 2500 0.25 22.12 1971• 5000 0.5 39.35 3109• 10000 1 63.21 3867• 20000 2 86.47 2991• 30000 3 95.02 1735• 40000 4 98.17 895• 50000 5 99.33 433• 60000 6 99.75 201• 70000 7 99.91 91• 80000 8 99.97 40• 90000 9 99.99 17• 100000 10 100.00 7
Graph of previous data
LW calculations for 5 MB genome, 500 bp reads
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2500
5000
1000
0
2000
0
3000
0
4000
0
5000
0
6000
0
7000
0
8000
0
9000
0
1000
00
number of reads
0102030405060708090100
# contigs % sequenced
Shotgun Sequencing Model
0
500
1000
1500
2000
2500
0 10,000 20,000 30,000 40,000 50,000 60,000 70,000
# Sequences
# n
on
-sin
gle
ton
co
nti
gs
Predicted 5.5 Mb size
Observed # non-singletons
Predicted 3.7Mb size
Genome size as predicted from the assembly
Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.
Combining sequence data types
• In practice, appears to be the best strategy for both microbial and eukaryotic genomes
• Creates assembly challenges of its own
One strategy for microbial genomes
• ~¼ run of 454 regular, ¼ run of paired end (2.5 kb library) plus one lane of Solexa
• Assemble Solexa data with Velvet• Assemble 454 data with Newbler• Shred the Velvet assembly into Newbler size
reads and add it to the 454 assembly• Use Solexa deep coverage to “polish”
Gap Closure Strategies
• Primer walk to sequence the rest of linking clones that span a scaffold gap
• Primer walk off clones at the ends of contigs for which there is no linking information
• PCR based on your best guess at contig order (comparison to other closely related genomes, predicted genes at the end of genomes, anything else you can come up with)
• Combinatorial PCR with primers designed at the end of each contig
Phred Scores
Phred Score P( incorrect base ) Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%