View
3
Download
0
Category
Preview:
Citation preview
Lecture: Bioinformatics
ENS Sacley, 2018
Lecture Overview
Mathias Weller
mathias.weller@u-pem.frLaurent Bulteau
laurent.bulteau@u-pem.fr
- Introduction
I Molecular BiologyI SequencingI AssemblyI Scaffolding
- Phylogenetics
. . .
- Genome Rearrangements
. . .
- Scaffold Filling
2/27
Molecular Biology Basics
DNA
- double strand
- nucleotides paired: A–T, C–G
- inside nucleus (eucaryotes)
Polymerase
single strand → double strand
3/27
Molecular Biology Basics
DNA
- double strand
- nucleotides paired: A–T, C–G
- inside nucleus (eucaryotes)
Polymerase
single strand → double strand
3/27
Molecular Biology Basics
Chromosomes
- haploid = 1 set of chromosomes
- diploid = 2 sets of chromosomes
(usually one from each parent)
- procaryotes one circular chromosome, haploid
- eucaryotes set of chromosomes, diploid
44 90 16 46 46
4/27
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
Mutation
..AATCGCTAA..
..AATCCTAA..
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion,
deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
Mutation
..AATCCTAA..
..AATCGCTAA..
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion,
substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
Mutation
..AATCGCTAA..
..AATCACTAA..
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss),
translocation,
inversion, crossover
5/27
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss), translocation,
inversion, crossover
5/27
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss), translocation,
inversion,
crossover
5/27
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss), translocation,
inversion, crossover
5/27
mRNA, Introns, Exons, Genes
mRNA
- single strand
- transported outside nucleus
- translated into actual proteins
- Thymine (T) → Uracil (U)
Introns & Exons
parts of DNA cut out when forming
mRNA (“splicing”)
- removed “intron”
- not removed “exon”
Gene
Gene = START. . . STOP (including
introns) (103-105bp)
6/27
mRNA, Introns, Exons, Genes
mRNA
- single strand
- transported outside nucleus
- translated into actual proteins
- Thymine (T) → Uracil (U)
Introns & Exons
parts of DNA cut out when forming
mRNA (“splicing”)
- removed “intron”
- not removed “exon”
Gene
Gene = START. . . STOP (including
introns) (103-105bp)
6/27
mRNA, Introns, Exons, Genes
mRNA
- single strand
- transported outside nucleus
- translated into actual proteins
- Thymine (T) → Uracil (U)
Introns & Exons
parts of DNA cut out when forming
mRNA (“splicing”)
- removed “intron”
- not removed “exon”
Gene
Gene = START. . . STOP (including
introns) (103-105bp)
6/27
mRNA, Introns, Exons, Genes
6/27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*
GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*
GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTC
TGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTC
TGGTACTCA......ACCTCTCAG
TGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTC
TGGTACTCA......ACCTCTCAG
TGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAG
TGGTACTCA......ACCTCTCAG
ACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG
ACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG
ACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
Third-Generation Sequencing
Single Molecule Real Time Sequencing
1. fix a polymerase enzyme under a microscope
2. attach fluorescent molecule to each nucleotide
3. polymerase clips off fluorescent molecule when attaching a base
4. observe change in fluorescence identify base
Nanopores
1. “pore” of diameter ≈ 1nm
2. only unfolded DNA molecule can pass through
3. detect fluctuation in electric field as bases pass through pore
9/27
Third-Generation Sequencing
Single Molecule Real Time Sequencing
1. fix a polymerase enzyme under a microscope
2. attach fluorescent molecule to each nucleotide
3. polymerase clips off fluorescent molecule when attaching a base
4. observe change in fluorescence identify base
Nanopores
1. “pore” of diameter ≈ 1nm
2. only unfolded DNA molecule can pass through
3. detect fluctuation in electric field as bases pass through pore
9/27
Conclusion: Sequencing
Sanger
cost: 20ct/kbp
reads: 102-103kbp
errors: 0.01-1%
2nd Gen
cost: 4ct/kbp
reads: 50-500PE
errors: 0.1-1%
3rd Gen
cost: 0.4ct/kbp
reads: 104-106
errors: 5-10%
10 /27
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTCACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
11 / 27
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
11 / 27
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 3: Shortest Common Superstring is NP-hard
Heuristic Assembly:
- Overlap-Layout-Consensus
- DeBruijn-Graph
11 / 27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Naive Overlap
AGGAGTCGAGTCCA→
O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Naive Overlap
AGGAGTCGAGTCCA
O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Naive Overlap
AGGAGTCGAGTCCA
O(#reads2·read-length) time worst-case
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�
1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA�
4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�
5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
Exercise
Time
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGGTCTCA→
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGGTCTCA
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGAGTCTCA
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGAGTCTCA
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
4 3 4 4 4 5 6
7
G
5 4 3 4 3 4 5
6
T
6 5 4 3 3 3 4
5
C
6 5 4 3 2 2 3
4
T
6 5 4 3 2 1 2
3
C
6 5 4 4 3 2 1
2
A
6 5 4 3 3 2 1
1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
4 3 4 4 4 5
6 7
G
5 4 3 4 3 4
5 6
T
6 5 4 3 3 3
4 5
C
6 5 4 3 2 2
3 4
T
6 5 4 3 2 1
2 3
C
6 5 4 4 3 2
1 2
A
6 5 4 3 3 2
1 1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
4 3 4 4 4 5
6 7
G
5 4 3 4 3 4
5 6
T
6 5 4 3 3 3
4 5
C
6 5 4 3 2 2
3 4
T
6 5 4 3 2 1
2 3
C
6 5 4 4 3 2
1 2
A 6 5 4 3 3 2 1 1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG 4 3 4 4 4 5 6 7
G 5 4 3 4 3 4 5 6
T 6 5 4 3 3 3 4 5
C 6 5 4 3 2 2 3 4
T 6 5 4 3 2 1 2 3
C 6 5 4 4 3 2 1 2
A 6 5 4 3 3 2 1 1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
3 2 1 1 1 2 1
0
G
4 3 2 1 0 1 1
0
T
5 4 3 2 1 0 1
0
C
5 4 3 2 1 1 0
0
T
5 4 3 2 1 0 1
0
C
6 5 4 3 2 1 0
0
A
6 5 4 3 3 2 1
0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)
Exercise
Time
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
3 2 1 1 1 2
1 0
G
4 3 2 1 0 1
1 0
T
5 4 3 2 1 0
1 0
C
5 4 3 2 1 1
0 0
T
5 4 3 2 1 0
1 0
C
6 5 4 3 2 1
0 0
A 6 5 4 3 3 2 1 0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG 3 2 1 1 1 2 1 0
G 4 3 2 1 0 1 1 0
T 5 4 3 2 1 0 1 0
C 5 4 3 2 1 1 0 0
T 5 4 3 2 1 0 1 0
C 6 5 4 3 2 1 0 0
A 6 5 4 3 3 2 1 0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG 3 2 1 1 1 2 1 0
G 4 3 2 1 0 1 1 0
T 5 4 3 2 1 0 1 0
C 5 4 3 2 1 1 0 0
T 5 4 3 2 1 0 1 0
C 6 5 4 3 2 1 0 0
A 6 5 4 3 3 2 1 0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
transitive reduction
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
transitive reduction
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACProblems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
Exercise
Time
..GAACTTCGCT..
.GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG..
.CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG...CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG...CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG...CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find Eulerian walk
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find Eulerian walk
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
Running Time
#k-mers = O(#reads · read-length)
1. O(1) per k-mer
2. O(1) per k-mer
3. O(size of graph) = O(#k-mers)
Note: edges can be weighted by #occurances
13 /27
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find Eulerian walk
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
Problems
- choose k well
I k too small small repeats become problemsI k too big miss smaller overlaps
- Eulerian walk not neccessarily unique
- some paths in DeBruijn graph inconsistent with reads
- read-errors problematic
error-correction step before assembling
- same problem with repeats as OLC
13 /27
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C
1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C
1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
Correcting Read Errors in DBG
Idea
“faulty” k-mers occur less often
than correct ones
build k-mer count histogram
De Bruijn
have to treat errors before
building the graph
k=30 & 1% error 1/4 faulty
15 /27
Correcting k-mer Errors
Example
suppose: avg. k-mer count = 10 each k-mer occurs about 10x
GCGTATTACGCGTCTGGCCTCGTATT 8x
GTATTA 9x
TATTAC 7x
ATTACG 12x
TTACGC 9x
TACGCG 9x
ACGCGT 10x
CGCGTC 11x
GCGTCT 10x
CGTCTG 9x
GTCTGG 10x
TCTGGC 10x
CTGGCC 11x
TGGCCT 9x
GCGTATTACTCGTCTGGCCTCGTATT 8x
GTATTA 9x
TATTAC 7x
ATTACT 1x
TTACTC 2x
TACTCG 2x
ACTCGT 1x
CTCGTC 1x
TCGTCT 1x
CGTCTG 9x
GTCTGG 10x
TCTGGC 10x
CTGGCC 11x
TGGCCT 9x
16 /27
Correcting k-mer Errors
17 /27
Correcting k-mer Errors
Problem
now we have an idea where an error is, but how to fix it?
Idea
errors turn frequent k-mers into infrequent ones
correction should turn infrequent k-mers into frequent ones
replace infrequent k-mer by “frequent neighbor”
18 /27
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGG
GTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp ∈ N
Question: Can M be covered by
≤σp alternating paths
σc alternating cycles
of total weight ≥ k?
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N
Question: Can M be covered by
≤σp alternating paths &
≤σc alternating cycles
of total weight ≥ k?
20/27
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Exact ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N
Question: Can M be covered by
σp alternating paths &
σc alternating cycles
of total weight ≥ k?
20/27
19
10
71
26
22
32
59
71
78
2 60
85
1
38
1
17
19
7
25
21 /27
140
262
149
119
336
116
397
72
1
75
14
5
100
1
21
16
10
37
45
1
22
1
7
61
78
69
1
15
51
40
65
2
61
21 /27
1
63
94
25
9
28
29
55
2
17
1
67
62
1
5
71
26
24
47
83
71
34
71
68
72
1
51
19
79
54
151
67
6
66
34
68
59
2
23
11
12
1
5
1
2
6
1
49
13
71
28
71
5
6
11
33
67
24
88
82
2
42
25
1
39
14
65
5
2
70
24
89
46
1
1
1
80
1
73
64
36
72
33
73
67
59
1
14
1
1
70
3
1
33
53
2
2
1
135
4
2980
74
80
84
11
31
1
1
62
73
38
31
33
77
108
28
13
1 4
1
70
8
5
1
47
3
31
2
1
18
56
49
45
2
2
21 /27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Construction
Given a directed graph D1. make a copy of D
2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Construction
Given a directed graph D1. make a copy of D2. duplicate all vertices M
3. “slide” down all arrow tips & ignore directions
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Construction
Given a directed graph D1. make a copy of D2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh
alternating X covers M X
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh
alternating X covers M X
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇐”: contract each matching edge in the covering alternating path
hits all vertices exactly once X is valid directed path X
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇐”: contract each matching edge in the covering alternating path
hits all vertices exactly once X is valid directed path X
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Scaffolding is NP-hard, even restricted to
• bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0}
Corollary
Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Scaffolding is NP-hard, even restricted to
• supergraphs of bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0, 1}
Corollary
Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.
22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Scaffolding is NP-hard, even restricted to
• supergraphs of bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0, 1}
Corollary
Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.22/27
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Exact Scaffolding is NP-hard, even restricted to
• supergraphs of bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0, 1}
Corollary
Exact Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.22/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding
1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution X
Note: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Theorem
Scaffolding in complete graphs can be
3-approximated in O(|V |2 log |V |) time.
Remark
For Exact Scaffolding, we have to keep an eye on the number of
components too.
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Theorem
Scaffolding in complete (bipartite) graphs can be
3-approximated in O(|V |2 log |V |) time.
Remark
For Exact Scaffolding, we have to keep an eye on the number of
components too.
23/27
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Theorem
Scaffolding in complete (bipartite) graphs can be
3-approximated in O(|V |2 log |V |) time.
Remark
For Exact Scaffolding, we have to keep an eye on the number of
components too.
23/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding
1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainProof
Result S∗ is a valid solution X
ω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainProof
Result S∗ is a valid solution Xω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainTheorem
Exact Scaffolding in complete graphs can be
2-approximated in O(|V |2.5) time.
Remark
For Scaffolding, replace Step 3 by either merging cycles or
removing lightest edge, whatever looses less weight
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainTheorem
Exact Scaffolding in complete (bipartite) graphs can be
2-approximated in O(|V |2.5) time.
Remark
For Scaffolding, replace Step 3 by either merging cycles or
removing lightest edge, whatever looses less weight
24/27
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainTheorem
Exact Scaffolding in complete (bipartite) graphs can be
2-approximated in O(|V |2.5) time.
Remark
For Scaffolding, replace Step 3 by either merging cycles or
removing lightest edge, whatever looses less weight
24/27
Scaffolding with Multiplicities
Recall: most eucaryotes are diploid!
GGTGCGAGAGAGGTCATGGATTGCAACGA
GGTGCGAGAGGCCACTCCAATTGCAACGA
×2
×1
×1
×2
25/27
Scaffolding with Multiplicities
Recall: most eucaryotes are diploid!
GGTGCGAGAGAGGTCATGGATTGCAACGA
GGTGCGAGAGGCCACTCCAATTGCAACGA
×2
×1
×1
×2
25/27
Scaffolding with Multiplicities
Recall: most eucaryotes are diploid!
GGTGCGAGAGAGGTCATGGATTGCAACGA
GGTGCGAGAGGCCACTCCAATTGCAACGA
×2
×1
×1
×2
25/27
Scaffolding with Multiplicities
70
5
47
67
70
135
49
3931
5
49
56
80
80
34
6
11
2489
19
25
83
73
82
71
9
14
63
72
64
28
29
33
66
33
68
94
62
62
11
47
18
67
8
6
74
59
12
34
46
88
42
79
77
13
17
26
24
70
65
45
33
73
67
23
71
108
67
36
13
55
5
38
72
51
71
5
28
73
31
71
59
80
54
84
28
29
71
151
24
5
6
33
68
53
25
×2
26/27
Scaffolding with Multiplicities
70
5
47
67
70
135
49
3931
5
49
56
80
80
34
6
11
2489
19
25
83
73
82
71
9
14
63
72
64
28
29
33
66
33
68
94
62
62
11
47
18
67
8
6
74
59
12
34
46
88
42
79
77
13
17
26
24
70
65
45
33
73
67
23
71
108
67
36
13
55
5
38
72
51
71
5
28
73
31
71
59
80
54
84
28
29
71
151
24
5
6
33
68
53
25
×2
26/27
Linearization of Solutions
Problem
no unique chromosome-configuration explaining solution
Proof
27/27
Linearization of Solutions
Problem
no unique chromosome-configuration explaining solution
Proof
27/27
Linearization of Solutions
Problem
no unique chromosome-configuration explaining solution
Proof
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
3 2 2 1 3 3 5 5 71
2 1 2 1
Proof
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
5
2
22
result is collection of alternating paths & cycles
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3 3 3
5
2
22
result is collection of alternating paths & cycles
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
5
2
22
result is collection of alternating paths & cycles
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
5
2
22
result is collection of alternating paths & cycles
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
3
22
result is collection of alternating paths & cycles
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
3
22
result is collection of alternating paths & cycles
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily
missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily
missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
Recommended