Upload
bpfanpage
View
710
Download
1
Embed Size (px)
Citation preview
Sequencing and Assembly
Cont’d
CS273a Lecture 5, Aut08, Batzoglou
Steps to Assemble a Genome
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge some “good” pairs of reads into longer contigs
3. Link contigs to form supercontigs
Some Terminology
read a 500-900 long word that comes out of sequencer
mate pair a pair of reads from two endsof the same insert fragment
contig a contiguous sequence formed by several overlapping readswith no gaps
supercontig an ordered and oriented set(scaffold) of contigs, usually by mate
pairs
consensus sequence derived from thesequene multiple alignment of reads
in a contig
CS273a Lecture 5, Aut08, Batzoglou
2. Merge Reads into Contigs
• Overlap graph: Nodes: reads r1…..rn
Edges: overlaps (ri, rj, shift, orientation, score)
Note:of course, we don’tknow the “color” ofthese nodes
Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat
CS273a Lecture 5, Aut08, Batzoglou
2. Merge Reads into Contigs
We want to merge reads up to potential repeat boundaries
repeat region
Unique Contig
Overcollapsed Contig
CS273a Lecture 5, Aut08, Batzoglou
2. Merge Reads into Contigs
• Remove transitively inferable overlaps If read r overlaps to the right reads r1, r2,
and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)
r r1 r2 r3
CS273a Lecture 5, Aut08, Batzoglou
2. Merge Reads into Contigs
CS273a Lecture 5, Aut08, Batzoglou
2. Merge Reads into Contigs
• Ignore “hanging” reads, when detecting repeat boundaries
sequencing error
repeat boundary???
ba
a
b
…
CS273a Lecture 5, Aut08, Batzoglou
Overlap graph after forming contigs
Unitigs:Gene Myers, 95
CS273a Lecture 5, Aut08, Batzoglou
Repeats, errors, and contig lengths
• Repeats shorter than read length are easily resolved Read that spans across a repeat disambiguates order of flanking regions
• Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat
• To make the genome appear less repetitive, try to:
Increase read length Decrease sequencing error rate
Role of error correction:Discards up to 98% of single-letter sequencing errors
decreases error rate decreases effective repeat content increases contig length
CS273a Lecture 5, Aut08, Batzoglou
3. Link Contigs into Supercontigs
Too dense Overcollapsed
Inconsistent links Overcollapsed?
Normal density
CS273a Lecture 5, Aut08, Batzoglou
Find all links between unique contigs
3. Link Contigs into Supercontigs
Connect contigs incrementally, if 2 forward-reverse links
supercontig(aka scaffold)
CS273a Lecture 5, Aut08, Batzoglou
Fill gaps in supercontigs with paths of repeat contigsComplex algorithmic step• Exponential number of paths• Forward-reverse links
3. Link Contigs into Supercontigs
CS273a Lecture 5, Aut08, Batzoglou
4. Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
(Alternative: take maximum-quality letter)
CS273a Lecture 5, Aut08, Batzoglou
Some Assemblers
• PHRAP• Early assembler, widely used, good model of read errors
• Overlap O(n2) layout (no mate pairs) consensus
• Celera• First assembler to handle large genomes (fly, human, mouse)
• Overlap layout consensus
• Arachne• Public assembler (mouse, several fungi)
• Overlap layout consensus
• Phusion• Overlap clustering PHRAP assemblage consensus
• Euler• Indexing Euler graph layout by picking paths consensus
CS273a Lecture 5, Aut08, Batzoglou
Quality of assemblies—mouse
Terminology: N50 contig lengthN50 contig lengthIf we sort contigs from largest to smallest, and startCovering the genome in that order, N50 is the lengthOf the contig that just covers the 50th percentile.
7.7X sequence coverage
CS273a Lecture 5, Aut08, Batzoglou
Quality of assemblies—dog
7.5X sequence coverage
CS273a Lecture 5, Aut08, Batzoglou
Quality of assemblies—chimp
3.6X sequence Coverage
AssistedAssembly
CS273a Lecture 5, Aut08, Batzoglou
History of WGA
• 1982: -virus, 48,502 bp
• 1995: h-influenzae, 1 Mbp
• 2000: fly, 100 Mbp
• 2001 – present human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee,
several fungal genomes
Gene Myers
Let’s sequence the human
genome with the shotgun
strategy
That is impossible, and
a bad idea anyway
Phil Green
1997
CS273a Lecture 5, Aut08, Batzoglou
$399 Personal Genome Service
$2,500 Health Compass service
$985 deCODEme(November 2007)
(November 2007)
(April 2008)
$350,000 Whole-genome sequencing(November 2007)
Genetic Information Nondiscrimination Act(May 2008)
CS273a Lecture 5, Aut08, Batzoglou
Whole-genome sequencing
Comparative genomicsGenome resequencing
Structural variation analysis
Polymorphism discoveryMetagenomicsEnvironmental
sequencingGene expression profiling
Applications
GenotypingPopulation genetics
Migration studiesAncestry inference
Relationship inferenceGenetic screening
Drug targetingForensics
CS273a Lecture 5, Aut08, Batzoglou
Sequencing applications
Demand for more sequencing
Sequencing technology improvement
Increase in sequencing data output
New sequencing applications
CS273a Lecture 5, Aut08, Batzoglou
Sequencing technologySequencing technologySanger sequencing
1975 1980 20081990 2000
$10.00
$1.00
$0.10
$0.01
Cost per finished bp:
Read length: 15 – 200 bp 500 – 1,000 bp
Throughput: “grad-student years” 2 ∙ 106 bp/day
Fred Sanger
CS273a Lecture 5, Aut08, Batzoglou
Sequencing technologySequencing technologySanger sequencing
3 ∙ 109 bp
1x coverage
10x coverage
2 ∙ 106 bp/day= 40 years
× 3 ∙ 109 bp
10x coverage × 3 ∙ 109 bp × $0.001/bp = $30 million
CS273a Lecture 5, Aut08, Batzoglou
Pyrosequencing on a chip
Mostafa Ronaghi, Stanford Genome Technologies Center
454 Life Sciences
CS273a Lecture 5, Aut08, Batzoglou
Sequencing technologySequencing technologyNext-generation sequencing
Read length: 250 bp
Throughput: 300 Mb/day
Cost: ~ 10,000 bp/$
De novo: yes
Genome Sequencer / FLX
“short reads”
CS273a Lecture 5, Aut08, Batzoglou
Single Molecule Array for Genotyping—Solexa
CS273a Lecture 5, Aut08, Batzoglou
Polony Sequencing
CS273a Lecture 5, Aut08, Batzoglou
Sequencing technologySequencing technologyNext-generation sequencing
Read length: ~ 35 bp
Throughput: 300 – 500 Mb/day
Cost: ~ 100,000 bp/$
De novo: yes
Genome Analyzer SOLiD Analyzer
“microreads”
CS273a Lecture 5, Aut08, Batzoglou
Molecular Inversion Probes
CS273a Lecture 5, Aut08, Batzoglou
Illumina Genotype Arrays
CS273a Lecture 5, Aut08, Batzoglou
Sequencing technologySequencing technology
Next-generation sequencing
Read length: 1 bp
Throughput: 1 – 2 Mb/day
Cost: 5,000 bp/$
De novo: no
Infinium Assay GeneChip Array
genotypes
“SNP chips”
CS273a Lecture 5, Aut08, Batzoglou
Nanopore Sequencing
http://www.mcb.harvard.edu/branton/index.htm
CS273a Lecture 5, Aut08, Batzoglou
Sequencing technologySequencing technologyNext-generation sequencing
CS273a Lecture 5, Aut08, Batzoglou
Sequencing technologySequencing technology
Technology Read length (bp)
Throughput (Mb/day)
Cost (bp/$)
De novo
Sanger 1,000 2 1,000
454 250 300 10,000
Solexa / ABI 35 500 100,000
SNP chip 1 2 5,000
Application Sanger 454 Solexa/ABISNP chip
Bacterial sequencing $ sometimes
Mammalian sequencing $$$not likely
today
Mammalian resequencing $$$ $ sort of
Metagenomics $ ?
Genotyping $$$ $$$ $$$
?