Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...

Preview:

Citation preview

Lecture: Bioinformatics

ENS Sacley, 2018

Lecture Overview

Mathias Weller

mathias.weller@u-pem.frLaurent Bulteau

laurent.bulteau@u-pem.fr

- Introduction

I Molecular BiologyI SequencingI AssemblyI Scaffolding

- Phylogenetics

. . .

- Genome Rearrangements

. . .

- Scaffold Filling

2/27

Molecular Biology Basics

DNA

- double strand

- nucleotides paired: A–T, C–G

- inside nucleus (eucaryotes)

Polymerase

single strand → double strand

3/27

Molecular Biology Basics

DNA

- double strand

- nucleotides paired: A–T, C–G

- inside nucleus (eucaryotes)

Polymerase

single strand → double strand

3/27

Molecular Biology Basics

Chromosomes

- haploid = 1 set of chromosomes

- diploid = 2 sets of chromosomes

(usually one from each parent)

- procaryotes one circular chromosome, haploid

- eucaryotes set of chromosomes, diploid

44 90 16 46 46

4/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Mutation

..AATCGCTAA..

..AATCCTAA..

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion,

deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Mutation

..AATCCTAA..

..AATCGCTAA..

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion,

substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Mutation

..AATCGCTAA..

..AATCACTAA..

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

5/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss),

translocation,

inversion, crossover

5/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss), translocation,

inversion, crossover

5/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss), translocation,

inversion,

crossover

5/27

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication, deletion (loss), translocation,

inversion, crossover

5/27

mRNA, Introns, Exons, Genes

mRNA

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene

Gene = START. . . STOP (including

introns) (103-105bp)

6/27

mRNA, Introns, Exons, Genes

mRNA

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene

Gene = START. . . STOP (including

introns) (103-105bp)

6/27

mRNA, Introns, Exons, Genes

mRNA

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene

Gene = START. . . STOP (including

introns) (103-105bp)

6/27

mRNA, Introns, Exons, Genes

6/27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*

GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*

GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

7 /27

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAG

TGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTC

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Next Generation Sequencing ( )

ACCA

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

8/27

Third-Generation Sequencing

Single Molecule Real Time Sequencing

1. fix a polymerase enzyme under a microscope

2. attach fluorescent molecule to each nucleotide

3. polymerase clips off fluorescent molecule when attaching a base

4. observe change in fluorescence identify base

Nanopores

1. “pore” of diameter ≈ 1nm

2. only unfolded DNA molecule can pass through

3. detect fluctuation in electric field as bases pass through pore

9/27

Third-Generation Sequencing

Single Molecule Real Time Sequencing

1. fix a polymerase enzyme under a microscope

2. attach fluorescent molecule to each nucleotide

3. polymerase clips off fluorescent molecule when attaching a base

4. observe change in fluorescence identify base

Nanopores

1. “pore” of diameter ≈ 1nm

2. only unfolded DNA molecule can pass through

3. detect fluctuation in electric field as bases pass through pore

9/27

Conclusion: Sequencing

Sanger

cost: 20ct/kbp

reads: 102-103kbp

errors: 0.01-1%

2nd Gen

cost: 4ct/kbp

reads: 50-500PE

errors: 0.1-1%

3rd Gen

cost: 0.4ct/kbp

reads: 104-106

errors: 5-10%

10 /27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTCACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

11 / 27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

11 / 27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

Problem 3: Shortest Common Superstring is NP-hard

Heuristic Assembly:

- Overlap-Layout-Consensus

- DeBruijn-Graph

11 / 27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Naive Overlap

AGGAGTCGAGTCCA→

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�

1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA�

4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�

5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

example: BANANA�

0

BANANA�1

ANANA�

2

NANA�

3

ANA

�NA� 4

NA

�NA�5

A

�NA

O(read-length2) time & space

can be improved to linear time & space

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

Exercise

Time

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Suffix Trees

AGGAGTC�GAGTCCA.

A TC

C

G

G6

.

0

GAGTC� TC

3

�1

CA.

5

�3

CA.

4

CA.

6

�5

A.

AGTCTC1

GAGTC�

4

�2

CA.2

�0

CA.

pairwise overlaps in O(#reads · read-length + #reads2)

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGGTCTCA→

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGGTCTCA

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGAGTCTCA

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

AGGAGTCGAGTCTCA

edit distance = # of insertions, deletions, and substitutions

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5 6

7

G

5 4 3 4 3 4 5

6

T

6 5 4 3 3 3 4

5

C

6 5 4 3 2 2 3

4

T

6 5 4 3 2 1 2

3

C

6 5 4 4 3 2 1

2

A

6 5 4 3 3 2 1

1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

6 7

G

5 4 3 4 3 4

5 6

T

6 5 4 3 3 3

4 5

C

6 5 4 3 2 2

3 4

T

6 5 4 3 2 1

2 3

C

6 5 4 4 3 2

1 2

A

6 5 4 3 3 2

1 1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

6 7

G

5 4 3 4 3 4

5 6

T

6 5 4 3 3 3

4 5

C

6 5 4 3 2 2

3 4

T

6 5 4 3 2 1

2 3

C

6 5 4 4 3 2

1 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 4 3 4 4 4 5 6 7

G 5 4 3 4 3 4 5 6

T 6 5 4 3 3 3 4 5

C 6 5 4 3 2 2 3 4

T 6 5 4 3 2 1 2 3

C 6 5 4 4 3 2 1 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2 1

0

G

4 3 2 1 0 1 1

0

T

5 4 3 2 1 0 1

0

C

5 4 3 2 1 1 0

0

T

5 4 3 2 1 0 1

0

C

6 5 4 3 2 1 0

0

A

6 5 4 3 3 2 1

0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)

Exercise

Time

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2

1 0

G

4 3 2 1 0 1

1 0

T

5 4 3 2 1 0

1 0

C

5 4 3 2 1 1

0 0

T

5 4 3 2 1 0

1 0

C

6 5 4 3 2 1

0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

Fuzzy Overlap – Edit Distance

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

transitive reduction

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

transitive reduction

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

ACTAG

AGTAG

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACProblems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Exercise

Time

..GAACTTCGCT..

.GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG..

.CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find Eulerian walk

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find Eulerian walk

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Running Time

#k-mers = O(#reads · read-length)

1. O(1) per k-mer

2. O(1) per k-mer

3. O(size of graph) = O(#k-mers)

Note: edges can be weighted by #occurances

13 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find Eulerian walk

linear time with greedy

k=5

GAAC

AACT

ACTTCTTC

TTCG

TCGC

CGCTCCTT

CTTG

TTGG

Problems

- choose k well

I k too small small repeats become problemsI k too big miss smaller overlaps

- Eulerian walk not neccessarily unique

- some paths in DeBruijn graph inconsistent with reads

- read-errors problematic

error-correction step before assembling

- same problem with repeats as OLC

13 /27

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C

1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

9

A

9

C

6

T

3

A

5

C

2

A

1

C

4

T

1

T

1

A

2

A

4

T

2

A

2

T

3

C

2

T

1

T

1

C

2

T

1

T

1

A

2

T

1

T

2

A

2

C

1

T

1

A

2

T

2

T

2

T

1

A

1

C

1

C1

A

1

C

1

1

1

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Correcting Read Errors in DBG

Idea

“faulty” k-mers occur less often

than correct ones

build k-mer count histogram

De Bruijn

have to treat errors before

building the graph

k=30 & 1% error 1/4 faulty

15 /27

Correcting k-mer Errors

Example

suppose: avg. k-mer count = 10 each k-mer occurs about 10x

GCGTATTACGCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACG 12x

TTACGC 9x

TACGCG 9x

ACGCGT 10x

CGCGTC 11x

GCGTCT 10x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

GCGTATTACTCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACT 1x

TTACTC 2x

TACTCG 2x

ACTCGT 1x

CTCGTC 1x

TCGTCT 1x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

16 /27

Correcting k-mer Errors

17 /27

Correcting k-mer Errors

Problem

now we have an idea where an error is, but how to fix it?

Idea

errors turn frequent k-mers into infrequent ones

correction should turn infrequent k-mers into frequent ones

replace infrequent k-mer by “frequent neighbor”

18 /27

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGG

GTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

2

5

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp ∈ N

Question: Can M be covered by

≤σp alternating paths

σc alternating cycles

of total weight ≥ k?

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

Question: Can M be covered by

≤σp alternating paths &

≤σc alternating cycles

of total weight ≥ k?

20/27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

25

10

9

14

3

613

10

14

3

6

Exact ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

Question: Can M be covered by

σp alternating paths &

σc alternating cycles

of total weight ≥ k?

20/27

19

10

71

26

22

32

59

71

78

2 60

85

1

38

1

17

19

7

25

21 /27

140

262

149

119

336

116

397

72

1

75

14

5

100

1

21

16

10

37

45

1

22

1

7

61

78

69

1

15

51

40

65

2

61

21 /27

1

63

94

25

9

28

29

55

2

17

1

67

62

1

5

71

26

24

47

83

71

34

71

68

72

1

51

19

79

54

151

67

6

66

34

68

59

2

23

11

12

1

5

1

2

6

1

49

13

71

28

71

5

6

11

33

67

24

88

82

2

42

25

1

39

14

65

5

2

70

24

89

46

1

1

1

80

1

73

64

36

72

33

73

67

59

1

14

1

1

70

3

1

33

53

2

2

1

135

4

2980

74

80

84

11

31

1

1

62

73

38

31

33

77

108

28

13

1 4

1

70

8

5

1

47

3

31

2

1

18

56

49

45

2

2

21 /27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D

2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M

3. “slide” down all arrow tips & ignore directions

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh

alternating X covers M X

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh

alternating X covers M X

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇐”: contract each matching edge in the covering alternating path

hits all vertices exactly once X is valid directed path X

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Lemma

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G“⇐”: contract each matching edge in the covering alternating path

hits all vertices exactly once X is valid directed path X

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.22/27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Theorem

Exact Scaffolding is NP-hard, even restricted to

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Exact Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.22/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding

1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution X

Note: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Proof

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

23/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding

1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution X

ω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution Xω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete (bipartite) graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27

2-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete (bipartite) graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

24/27

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

25/27

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

25/27

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

×2

×1

×1

×2

25/27

Scaffolding with Multiplicities

70

5

47

67

70

135

49

3931

5

49

56

80

80

34

6

11

2489

19

25

83

73

82

71

9

14

63

72

64

28

29

33

66

33

68

94

62

62

11

47

18

67

8

6

74

59

12

34

46

88

42

79

77

13

17

26

24

70

65

45

33

73

67

23

71

108

67

36

13

55

5

38

72

51

71

5

28

73

31

71

59

80

54

84

28

29

71

151

24

5

6

33

68

53

25

×2

26/27

Scaffolding with Multiplicities

70

5

47

67

70

135

49

3931

5

49

56

80

80

34

6

11

2489

19

25

83

73

82

71

9

14

63

72

64

28

29

33

66

33

68

94

62

62

11

47

18

67

8

6

74

59

12

34

46

88

42

79

77

13

17

26

24

70

65

45

33

73

67

23

71

108

67

36

13

55

5

38

72

51

71

5

28

73

31

71

59

80

54

84

28

29

71

151

24

5

6

33

68

53

25

×2

26/27

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

27/27

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

27/27

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Proof

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

3 2 2 1 3 3 5 5 71

2 1 2 1

Proof

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇒”: contraposition; let p = ambigous path

2 2 21

11

(G ,M,m) not uniquely linearizable

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

5

2

22

result is collection of alternating paths & cycles

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3 3 3

5

2

22

result is collection of alternating paths & cycles

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

5

2

22

result is collection of alternating paths & cycles

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

5

2

22

result is collection of alternating paths & cycles

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

3

22

result is collection of alternating paths & cycles

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

Proof

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

3

3

22

result is collection of alternating paths & cycles

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Multiplicitiesone &

#non-matching adj. to contig

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily missassembly

2. isolate each ambiguous path information loss

3. cut as few ends as possible as hard as Vertex Cover

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

27/27

Recommended