34
Sequencing and Assembly Cont’d

Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

Embed Size (px)

Citation preview

Page 1: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

Sequencing and Assembly

Cont’d

Page 2: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Steps to Assemble a Genome

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some Terminology

read a 500-900 long word that comes out of sequencer

mate pair a pair of reads from two endsof the same insert fragment

contig a contiguous sequence formed by several overlapping readswith no gaps

supercontig an ordered and oriented set(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from thesequene multiple alignment of reads

in a contig

Page 3: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

2. Merge Reads into Contigs

• Overlap graph: Nodes: reads r1…..rn

Edges: overlaps (ri, rj, shift, orientation, score)

Note:of course, we don’tknow the “color” ofthese nodes

Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat

Page 4: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

2. Merge Reads into Contigs

We want to merge reads up to potential repeat boundaries

repeat region

Unique Contig

Overcollapsed Contig

Page 5: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

2. Merge Reads into Contigs

• Remove transitively inferable overlaps If read r overlaps to the right reads r1, r2,

and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)

r r1 r2 r3

Page 6: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

2. Merge Reads into Contigs

Page 7: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

2. Merge Reads into Contigs

• Ignore “hanging” reads, when detecting repeat boundaries

sequencing error

repeat boundary???

ba

a

b

Page 8: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Overlap graph after forming contigs

Unitigs:Gene Myers, 95

Page 9: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Repeats, errors, and contig lengths

• Repeats shorter than read length are easily resolved Read that spans across a repeat disambiguates order of flanking regions

• Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat

• To make the genome appear less repetitive, try to:

Increase read length Decrease sequencing error rate

Role of error correction:Discards up to 98% of single-letter sequencing errors

decreases error rate decreases effective repeat content increases contig length

Page 10: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

3. Link Contigs into Supercontigs

Too dense Overcollapsed

Inconsistent links Overcollapsed?

Normal density

Page 11: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Find all links between unique contigs

3. Link Contigs into Supercontigs

Connect contigs incrementally, if 2 forward-reverse links

supercontig(aka scaffold)

Page 12: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Fill gaps in supercontigs with paths of repeat contigsComplex algorithmic step• Exponential number of paths• Forward-reverse links

3. Link Contigs into Supercontigs

Page 13: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

4. Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)

Page 14: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Some Assemblers

• PHRAP• Early assembler, widely used, good model of read errors

• Overlap O(n2) layout (no mate pairs) consensus

• Celera• First assembler to handle large genomes (fly, human, mouse)

• Overlap layout consensus

• Arachne• Public assembler (mouse, several fungi)

• Overlap layout consensus

• Phusion• Overlap clustering PHRAP assemblage consensus

• Euler• Indexing Euler graph layout by picking paths consensus

Page 15: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Quality of assemblies—mouse

Terminology: N50 contig lengthN50 contig lengthIf we sort contigs from largest to smallest, and startCovering the genome in that order, N50 is the lengthOf the contig that just covers the 50th percentile.

7.7X sequence coverage

Page 16: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Quality of assemblies—dog

7.5X sequence coverage

Page 17: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Quality of assemblies—chimp

3.6X sequence Coverage

AssistedAssembly

Page 18: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

History of WGA

• 1982: -virus, 48,502 bp

• 1995: h-influenzae, 1 Mbp

• 2000: fly, 100 Mbp

• 2001 – present human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee,

several fungal genomes

Gene Myers

Let’s sequence the human

genome with the shotgun

strategy

That is impossible, and

a bad idea anyway

Phil Green

1997

Page 19: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

$399 Personal Genome Service

$2,500 Health Compass service

$985 deCODEme(November 2007)

(November 2007)

(April 2008)

$350,000 Whole-genome sequencing(November 2007)

Genetic Information Nondiscrimination Act(May 2008)

Page 20: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Whole-genome sequencing

Comparative genomicsGenome resequencing

Structural variation analysis

Polymorphism discoveryMetagenomicsEnvironmental

sequencingGene expression profiling

Applications

GenotypingPopulation genetics

Migration studiesAncestry inference

Relationship inferenceGenetic screening

Drug targetingForensics

Page 21: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing applications

Demand for more sequencing

Sequencing technology improvement

Increase in sequencing data output

New sequencing applications

Page 22: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing technologySequencing technologySanger sequencing

1975 1980 20081990 2000

$10.00

$1.00

$0.10

$0.01

Cost per finished bp:

Read length: 15 – 200 bp 500 – 1,000 bp

Throughput: “grad-student years” 2 ∙ 106 bp/day

Fred Sanger

Page 23: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing technologySequencing technologySanger sequencing

3 ∙ 109 bp

1x coverage

10x coverage

2 ∙ 106 bp/day= 40 years

× 3 ∙ 109 bp

10x coverage × 3 ∙ 109 bp × $0.001/bp = $30 million

Page 24: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Pyrosequencing on a chip

Mostafa Ronaghi, Stanford Genome Technologies Center

454 Life Sciences

Page 25: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing technologySequencing technologyNext-generation sequencing

Read length: 250 bp

Throughput: 300 Mb/day

Cost: ~ 10,000 bp/$

De novo: yes

Genome Sequencer / FLX

“short reads”

Page 26: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Single Molecule Array for Genotyping—Solexa

Page 27: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Polony Sequencing

Page 28: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing technologySequencing technologyNext-generation sequencing

Read length: ~ 35 bp

Throughput: 300 – 500 Mb/day

Cost: ~ 100,000 bp/$

De novo: yes

Genome Analyzer SOLiD Analyzer

“microreads”

Page 29: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Molecular Inversion Probes

Page 30: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Illumina Genotype Arrays

Page 31: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing technologySequencing technology

Next-generation sequencing

Read length: 1 bp

Throughput: 1 – 2 Mb/day

Cost: 5,000 bp/$

De novo: no

Infinium Assay GeneChip Array

genotypes

“SNP chips”

Page 32: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Nanopore Sequencing

http://www.mcb.harvard.edu/branton/index.htm

Page 33: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing technologySequencing technologyNext-generation sequencing

Page 34: Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou

CS273a Lecture 5, Aut08, Batzoglou

Sequencing technologySequencing technology

Technology Read length (bp)

Throughput (Mb/day)

Cost (bp/$)

De novo

Sanger 1,000 2 1,000

454 250 300 10,000

Solexa / ABI 35 500 100,000

SNP chip 1 2 5,000

Application Sanger 454 Solexa/ABISNP chip

Bacterial sequencing $ sometimes

Mammalian sequencing $$$not likely

today

Mammalian resequencing $$$ $ sort of

Metagenomics $ ?

Genotyping $$$ $$$ $$$

?