Sequencing and Assembly GEN875, Genomics and Proteomics, Fall 2010

Sequencing and Assembly

GEN875, Genomics and Proteomics, Fall 2010

History of DNA Sequencing History of DNA Sequencing

Avery: Proposes DNA as ‘Genetic Material’

Watson & Crick: Double Helix Structure of DNA

Holley: Sequences Yeast tRNAAla

1870

1953

1940

1965

1970

1977

1980

1990

2002

Miescher: Discovers DNA

Wu: Sequences Cohesive End DNA

Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation

Messing: M13 Cloning

Hood et al.: Partial Automation

• Cycle Sequencing • Improved Sequencing Enzymes• Improved Fluorescent Detection Schemes

1986

• Next Generation Sequencing•Improved enzymes and chemistry•Improved image processing

Adapted from Francis Oulette; Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

Adapted from Francis Oulette; Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

1

15

150

50,000

25,000

1,500

200,000

50,000,000

Efficiency(bp/person/year)

15,000

100,000,000,000 2008

Explosive Growth in Sequencing

8/22/2005Press Release:

INSD (GenBank, EMBL, DDBJ)

reaches 100 Gigabase

milestone

What do we sequence?

• Genomes (de novo, resequencing)

• Metagenomes or complex samples

• Transcripts

• Fragments recovered by chIP or tagged in some other way

NCBI Genomes http://www.ncbi.nlm.nih.gov/Genomes/

Eukaryotic Genomes:

Complete 23, 25, 22, 20 Assembly 230, 162, 109, 72In progress 229, 235, 299, 166

Prokaryotic Genomes:

Complete 745, 567, 371, 254In progress 1215, 841, 615, 433

Comparison of data from 9/4/08, 9/5/07, 9/4/06 and 8/31/05

NCBI Genomes 9/6/2010

Sequencing Platforms• Sanger sequencing and capillary electrophoresis

• Massively parallel pyrosequencing (454)

• “proprietary Clonal Single Molecule Array technology and novel reversible terminator-based sequencing” (Illumina)

• Sequencing by ligation (ABI SOLiD)

• Single molecule sequencing (PacBio)

Basics of the “old” technology• Clone the DNA.• Generate a ladder of labeled (colored) molecules that

are different by 1 nucleotide.• Separate mixture on some matrix.• Detect fluorochrome by laser.• Interpret peaks as string of DNA.• Strings are 500 to 1,000 letters long• 1 machine generates 57,000 nucleotides/run• Assemble all strings into a genome.

Adapted from Francis Oulette

Sample

Isolate DNA

Physical fragmentationSize

selection

Ligate randomly into vectors

Transformation Plate on agar

Pick and grow individual colonies

Isolate cloned constructs

Cycle Sequencing

High-throughput Steps

Library construction and sequencing

Dual Ended Sequencing Can Provide Information to Link Contigs

5 Kb insert

PrimerA

PrimerB

Sequencing with primers that begin in the vector on either side of the insert yields about 800 bp of DNA sequence from each end of the insert

The middle of the insert is never sequenced for most clones used in the project

Basics of the “new” technology

• Get DNA.• Attach it to something.• Extend and amplify signal with some labeling scheme.• Detect fluorochrome by microscopy.• Interpret series of spots as short strings of DNA.• Strings are 30-300 letters long• Multiple images are interpreted as 0.4 to 1.2 GB/run

(1,200,000,000 letters/day). • Map or align strings to one or many genome or assemble.


Differences between the various platforms:

• Nanotechnology used.

• Resolution of the image analysis.

• Chemistry and enzymology.

• Signal to noise detection in the software

• Software/images/file size/pipeline

• Cost $$$


Next Generation DNA Sequencing Technologies

Genome size: 3000 MbReq'd coverage: 6 12 25

3730 454 FLX Solexabp/read 600 400 32Reads/run 96 400,000 40,000,000 bp/run 57,600 160,000,000 1,280,000,000 #/runs req'd 312,500 225 59

Cost per run 48$ 6,800$ 9,300$ Total cost 15,000,000$ 1,530,000$ 544,922$

Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk

3 Gb ==

Roche 454

Genome sequencing in microfabricated high-density picolitre reactors Margulies, M. Eghold, M. et al. Nature. 2005 Sep 15; 437(7057):326-7

Pyrosequencing

Throughput400-600 million high-quality, filter-passed bases per run* 1 billion bases per day

Run Time 10 hours

Read Length Average length = 400 bases

AccuracyQ20 read length of 400 bases (99% at 400 bases and higher for prior bases)

Reads per run >1 million high-quality reads

GS FLX Titanium Series:

The GS FLX Titanium series reagents run on the Genome Sequencer FLX Instrument, a system based on 454's sequencing-by-synthesis technology. The GS FLX Titanium series improves on the current system with upgraded reagents, consumables, and software.

GS FLX Titanium Series:

http://454.com/products-solutions/how-it-works/sequencing-chemistry.asp

Solexa-based Whole Genome SequencingAdapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk

Solexa-based Whole Genome Sequencing

Solexa flow cell

~50M clusters are sequenced per flow cell.

Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk

Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4

Genome-scale Sequence Analysis

• De novo assembly

• Templated assembly

• Read mapping or alignment to a reference genome

“The choice of alignment or assembly algorithm is strongly influencedby both the experiment in question and the details of thesequencing technology used. The performance characteristics ofthe sequencing machines are changing rapidly, and any delineationof performance characteristics such as machine capacity, runtime or read length and its relationship to error profile will quicklybe outdated.”

Assemblers

• Greedy Assemblers – compare all reads to each other then join them in order of overlap size

Figure 8. Greedy assembly of four reads.

Assemblers

• Overlap Graph Assemblers – make a graph where each node represents a read and edges between them represent overlaps.

Figure 9. Overlap graph for a bacterial genome. The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right). The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Assembly with dual-ended sequencing

Sequence assembly

Contigs linked by a spanning clone

Contigs joined by overlaps

Scaffold – two or more linked contigs

Repeat handling

• Screen out known repeats and set them aside for later

• Infer repetitiveness based on coverage

• First assemble unambiguous overlaps, then resolve repeats using mate pairs

Assemblers and short reads

• Full overlap assemblers compare all reads against all other reads. Scale quadratically with the number of reads.

• Computationally intractable for large NSG datasets

• Led to development of k-mer based methods: a de Bruijn graph with a node for every k-mer observed in the sequence set and an edge between nodes if these two k-mers are observed adjacently in a read

Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

Figure 1. The k-mer uniqueness ratio for five well-known organisms andone single-celled human parasite. The ratio is defined here as the percentageof the genome that is covered by unique sequences of length k orlonger. The horizontal axis shows the length in base pairs of the sequences.For example, ;92.5% of the grapevine genome is contained inunique sequences of 100 bp or longer.

De Bruijn k-mer assemblers

• Newbler (Roche 454)

• SHARCGS• VCAKE• VELVET• EULER-SR• EDENA• ABySS

• ALLPATHS• SOAPdenovo• Contrail

• Most assemblers have an error detection and resolution phase

• Errors produce characteristic graphic structures

Problems with de Bruijn graph methods

• Require large amount of memory to store graph – for example Velvet would require a terabyte of memory to assemble the human genome

• Not as easy to parallelize as overlap assemblers

From Shatz et al. 2010:“To date, only two de Bruijn graph assemblers have been shown to have the

ability to assemble a mammalian-sized genome. ABySS (Simpson et al. 2009) assembled a human genome in 87 h on a cluster of 21 eight-core machines each with 16 GB of RAM (168 cores, 336 GB of RAM total). SOAPdenovo assembled a human genome in 40 h using a single computer with 32 cores and 512 GB of RAM (Li et al. 2010). Although these types of computing resources are not widely available, they are within reach for large-scale scientific centers.”

How many clones/reads do we need?

…according to the work of Lander and Waterman (Genomic mapping by fingerprinting random clones: a mathematical analysis.Genomics. 1988

Apr;2(3):231-9.), the number of “islands” or contigs formed from randomly collected sequences depends on:

G = Genome LengthL = Sequence Read LengthN = Number of Sequences CollectedT = Number of Basepairs of Overlap Needed

# Islands = NeLNG

TL(- ( 1 - ))

5 Mbp Genome, 500 bp reads, 25 bp overlap

• # readscoverage % sequenced # contigs• 2500 0.25 22.12 1971• 5000 0.5 39.35 3109• 10000 1 63.21 3867• 20000 2 86.47 2991• 30000 3 95.02 1735• 40000 4 98.17 895• 50000 5 99.33 433• 60000 6 99.75 201• 70000 7 99.91 91• 80000 8 99.97 40• 90000 9 99.99 17• 100000 10 100.00 7

Graph of previous data

LW calculations for 5 MB genome, 500 bp reads

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2500

5000

1000

0

2000

0

3000

0

4000

0

5000

0

6000

0

7000

0

8000

0

9000

0

1000

00

number of reads

0102030405060708090100

# contigs % sequenced

Shotgun Sequencing Model

0

500

1000

1500

2000

2500

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000

# Sequences

# n

on

-sin

gle

ton

co

nti

gs

Predicted 5.5 Mb size

Observed # non-singletons

Predicted 3.7Mb size

Genome size as predicted from the assembly

Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

Combining sequence data types

• In practice, appears to be the best strategy for both microbial and eukaryotic genomes

• Creates assembly challenges of its own

One strategy for microbial genomes

• ~¼ run of 454 regular, ¼ run of paired end (2.5 kb library) plus one lane of Solexa

• Assemble Solexa data with Velvet• Assemble 454 data with Newbler• Shred the Velvet assembly into Newbler size

reads and add it to the 454 assembly• Use Solexa deep coverage to “polish”

Gap Closure Strategies

• Primer walk to sequence the rest of linking clones that span a scaffold gap

• Primer walk off clones at the ends of contigs for which there is no linking information

• PCR based on your best guess at contig order (comparison to other closely related genomes, predicted genes at the end of genomes, anything else you can come up with)

• Combinatorial PCR with primers designed at the end of each contig

Phred Scores

Phred Score P( incorrect base ) Base call accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10000 99.99%

50 1 in 100000 99.999%

Documents

Sequencing and Assembly GEN875, Genomics and Proteomics, Fall 2010