Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...

Lecture: Bioinformatics

ENS Sacley, 2018

Lecture Overview

Mathias Weller

mathias.weller@u-pem.frLaurent Bulteau

laurent.bulteau@u-pem.fr

- Introduction

I Molecular BiologyI SequencingI AssemblyI Scaffolding

- Phylogenetics

- Genome Rearrangements

- Scaffold Filling

Molecular Biology Basics

- double strand

- nucleotides paired: A–T, C–G

- inside nucleus (eucaryotes)

Polymerase

single strand → double strand

- double strand

- nucleotides paired: A–T, C–G

- inside nucleus (eucaryotes)

Polymerase

single strand → double strand

Chromosomes

- haploid = 1 set of chromosomes

- diploid = 2 sets of chromosomes

(usually one from each parent)

- procaryotes one circular chromosome, haploid

- eucaryotes set of chromosomes, diploid

44 90 16 46 46

Mutation

Single-Nucleotide Polymorphism

- DNA damage

- caused by radioactivity, UV light, . . .

- insertion, deletion, substitution

Replication Error

- DNA rearrangement

- caused by errors in Meiosis/Mitosis

- duplication,

deletion (loss), translocation,

inversion, crossover

Mutation

..AATCGCTAA..

..AATCCTAA..

- DNA damage

- insertion,

deletion, substitution

Replication Error

- DNA rearrangement

- duplication,

Mutation

..AATCCTAA..

..AATCGCTAA..

- DNA damage

- insertion, deletion,

substitution

Replication Error

- DNA rearrangement

- duplication,

Mutation

..AATCGCTAA..

..AATCACTAA..

- DNA damage

Replication Error

- DNA rearrangement

- duplication,

Mutation

- DNA damage

Replication Error

- DNA rearrangement

- duplication,

Mutation

- DNA damage

Replication Error

- DNA rearrangement

- duplication,

Mutation

- DNA damage

Replication Error

- DNA rearrangement

- duplication, deletion (loss),

translocation,

Mutation

- DNA damage

Replication Error

- DNA rearrangement

- duplication, deletion (loss), translocation,

Mutation

- DNA damage

Replication Error

- DNA rearrangement

inversion,

crossover

Mutation

- DNA damage

Replication Error

- DNA rearrangement

mRNA, Introns, Exons, Genes

- single strand

- transported outside nucleus

- translated into actual proteins

- Thymine (T) → Uracil (U)

Introns & Exons

parts of DNA cut out when forming

mRNA (“splicing”)

- removed “intron”

- not removed “exon”

Gene = START. . . STOP (including

introns) (103-105bp)

- single strand

Introns & Exons

- single strand

Introns & Exons

Sanger Sequencing [Sanger et al ’77]

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

1. split helix of thousands of copies (“amplified genome”)

2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)

4. stir & let polymerase act

5. measure the length of each fragment

each length is the position of a T in the template

Problem

- frequency of longer reads decreases drastically

- length-estimate unreliable after a couple hundred bp

chop DNA into pieces and read those

- repeated bases unreliable

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

Sanger Sequencing

Problem

CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

Problem

Sanger Sequencing

Problem

Sanger Sequencing

Problem

GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA

GGACCTGCCCA*GGACCTGCCCAGTCTGTA*

Sanger Sequencing

Problem

GGA*GGACCTGCCCA*

GGACCTGCCCAGTCTGTA*

Sanger Sequencing

Problem

Sanger Sequencing

Problem

Sanger Sequencing

Problem

Sanger Sequencing

Problem

Next Generation Sequencing ( )

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

ACTCATGGTCTGAGAGGT......TGAGTACCA

TGGTACTCA......ACCTCTCAG

1. chop DNA into smaller pieces (approximate size known)

2. add anchors to each end of each piece

3. “flow cell” containing anchor places

4. strand anchors its two ends to two anchor places

5. polymerase completes the strand into double-strand with

“tagged” bases light emission

6. double strand is cut into single strands

7. rinse, repeat (last 3 steps) until flow chip is “full”

Paired-End reads (distance between reads = “insert size”)

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG

AGTCTGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAGACCTCTCAG

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTC

TGGTACTCA......ACCTCTCAGACCTCTCAG

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAG

ACCTCTCAG

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

CTGAGAGGT......TGAGTACCA

TGGAGAGTC

TGAGTACCA

ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG

ACCTCTCAG

ACTCATGGT

AGTCTGGAGAGTC

TGAGTACCA

ACTCATGGT

AGTCTGGAGAGTC

TGAGTACCA

ACTCATGGT

AGTCTGGAGAGTC

TGAGTACCA

ACTCATGGT

Third-Generation Sequencing

Single Molecule Real Time Sequencing

1. fix a polymerase enzyme under a microscope

2. attach fluorescent molecule to each nucleotide

3. polymerase clips off fluorescent molecule when attaching a base

4. observe change in fluorescence identify base

Nanopores

1. “pore” of diameter ≈ 1nm

2. only unfolded DNA molecule can pass through

3. detect fluctuation in electric field as bases pass through pore

Third-Generation Sequencing

Single Molecule Real Time Sequencing

1. fix a polymerase enzyme under a microscope

2. attach fluorescent molecule to each nucleotide

3. polymerase clips off fluorescent molecule when attaching a base

4. observe change in fluorescence identify base

Nanopores

1. “pore” of diameter ≈ 1nm

2. only unfolded DNA molecule can pass through

3. detect fluctuation in electric field as bases pass through pore

Conclusion: Sequencing

Sanger

cost: 20ct/kbp

reads: 102-103kbp

errors: 0.01-1%

2nd Gen

cost: 4ct/kbp

reads: 50-500PE

errors: 0.1-1%

3rd Gen

cost: 0.4ct/kbp

reads: 104-106

errors: 5-10%

10 /27

Sequence Assembly: Overview

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTCACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Goal: reconstruct sequence

Problem 1: only have (small) reads

Idea: overlap reads to form complete sequence

11 / 27

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA

TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

11 / 27

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Problem 2: parts of the sequence might not be covered by reads

sequence with “high coverage”

11 / 27

ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC

TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

11 / 27

Problem 3: Shortest Common Superstring is NP-hard

Heuristic Assembly:

- Overlap-Layout-Consensus

- DeBruijn-Graph

11 / 27

Overlap-Layout-Consensus Assembly

Overlap-Layout-Consensus assemblers

1. produce pairwise overlaps

2. layout the reads according to the overlaps

3. for each position, compute consensus base

Problems

- overlap step too slow in practice:

108 reads 1016 read-pairs

heuristics exclude most of the read-pairs before overlap

- fragmented genome due to repeats

12 /27

Naive Overlap

AGGAGTCGAGTCCA→

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps

Problems

12 /27

Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps

Problems

12 /27

Naive Overlap

AGGAGTCGAGTCCA

O(#reads2·read-length) time worst-case

Problems

12 /27

Suffix Trees

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA� 4

�NA�5

O(read-length2) time & space

can be improved to linear time & space2. layout the reads according to the overlaps

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�

ANANA�

NANA�

�NA� 4

�NA�5

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA� 4

�NA�5

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA� 4

�NA�5

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA�

�NA�5

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA� 4

�NA�

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA� 4

�NA�5

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA� 4

�NA�5

Problems

12 /27

Suffix Trees

example: BANANA�

BANANA�1

ANANA�

NANA�

�NA� 4

�NA�5

can be improved to linear time & space

Problems

12 /27

Suffix Trees

AGGAGTC�GAGTCCA.

Exercise

GAGTC� TC

AGTCTC1

GAGTC�

pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps

Problems

12 /27

Suffix Trees

AGGAGTC�GAGTCCA.

GAGTC� TC

AGTCTC1

GAGTC�

Problems

12 /27

Suffix Trees

AGGAGTC�GAGTCCA.

GAGTC� TC

AGTCTC1

GAGTC�

Problems

12 /27

Suffix Trees

AGGAGTC�GAGTCCA.

GAGTC� TC

AGTCTC1

GAGTC�

Problems

12 /27

Suffix Trees

AGGAGTC�GAGTCCA.

GAGTC� TC

AGTCTC1

GAGTC�

Problems

12 /27

Suffix Trees

AGGAGTC�GAGTCCA.

GAGTC� TC

AGTCTC1

GAGTC�

Problems

12 /27

Suffix Trees

AGGAGTC�GAGTCCA.

GAGTC� TC

AGTCTC1

GAGTC�

pairwise overlaps in O(#reads · read-length + #reads2)

Problems

12 /27

Fuzzy Overlap – Edit Distance

AGGAGTCGGTCTCA→

edit distance = # of insertions, deletions, and substitutions

Problems

12 /27

AGGAGTCGGTCTCA

Problems

12 /27

AGGAGTCGAGTCTCA

Problems

12 /27

AGGAGTCGAGTCTCA

Problems

12 /27

dynamic programming where

[i,j] = edit distance of Xi. . . & Yj. . .

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5 6

5 4 3 4 3 4 5

6 5 4 3 3 3 4

6 5 4 3 2 2 3

6 5 4 3 2 1 2

6 5 4 4 3 2 1

6 5 4 3 3 2 1

7 6 5 4 3 2 1 0

modification: any suffix of GGTCTCA & prefix of AGGAGTC for free

best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps

Problems

12 /27

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

5 4 3 4 3 4

6 5 4 3 3 3

6 5 4 3 2 2

6 5 4 3 2 1

6 5 4 4 3 2

6 5 4 3 3 2

7 6 5 4 3 2 1 0

Problems

12 /27

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

4 3 4 4 4 5

5 4 3 4 3 4

6 5 4 3 3 3

6 5 4 3 2 2

6 5 4 3 2 1

6 5 4 4 3 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0

Problems

12 /27

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 4 3 4 4 4 5 6 7

G 5 4 3 4 3 4 5 6

T 6 5 4 3 3 3 4 5

C 6 5 4 3 2 2 3 4

T 6 5 4 3 2 1 2 3

C 6 5 4 4 3 2 1 2

A 6 5 4 3 3 2 1 1

7 6 5 4 3 2 1 0

Problems

12 /27

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2 1

4 3 2 1 0 1 1

5 4 3 2 1 0 1

5 4 3 2 1 1 0

5 4 3 2 1 0 1

6 5 4 3 2 1 0

6 5 4 3 3 2 1

7 6 5 4 3 2 1 0

best overlaps with k errors in O(#reads2 · read-length2)

Exercise

Problems

12 /27

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG

3 2 1 1 1 2

4 3 2 1 0 1

5 4 3 2 1 0

5 4 3 2 1 1

5 4 3 2 1 0

6 5 4 3 2 1

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

Problems

12 /27

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

Problems

12 /27

= min{ ↓ +1, → +1, → + idXi,Yj}

A G G A G T CG 3 2 1 1 1 2 1 0

G 4 3 2 1 0 1 1 0

T 5 4 3 2 1 0 1 0

C 5 4 3 2 1 1 0 0

T 5 4 3 2 1 0 1 0

C 6 5 4 3 2 1 0 0

A 6 5 4 3 3 2 1 0

7 6 5 4 3 2 1 0

best overlaps with k errors in O(#reads2 · read-length2)

Problems

12 /27

Problems

12 /27

Overlap Graph

reads = vertices

directed edges = overlaps

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

TAGTAG

TAGCCT

overlap graph non-linear due to repeats

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base

Problems

12 /27

Overlap Graph

reads = vertices

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

TAGTAG

TAGCCT

transitive reduction

Problems

12 /27

Overlap Graph

reads = vertices

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

TAGTAG

TAGCCT

transitive reduction

Problems

12 /27

Overlap Graph

reads = vertices

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

TAGTAG

TAGCCT

Problems

12 /27

Overlap Graph

reads = vertices

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

TAGTAG

TAGCCT

Problems

12 /27

Overlap Graph

reads = vertices

GCTAGTGGCTAG

TGGCTAGGGTC

CTAGGGTCCGGA

AGGGTCCGGAATTA

ACTAGTAGTAGCCT

TAGTAG

TAGCCT

only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT

Problems

12 /27

Problems

12 /27

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACProblems

12 /27

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC

TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC

Problems

12 /27

Problems

12 /27

DeBruijn-Graph-Based Assembly

1. chop all reads into “k-mers”

real genomes: k = 30-50

2. build “DeBruijn graph”:

for each k-mer add arc from

left to right k-1 mer

3. find path using all overlaps

linear time with greedy

ACTTCTTC

CGCTCCTT

13 /27

ACTTCTTC

CGCTCCTT

Exercise

..GAACTTCGCT..

.GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG..

.CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

ACTTCTTC

CGCTCCTT

..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT

..CCTTGG...CCTTCCTTGTTGG.

..CCTTC..

..ACTTG..

13 /27

ACTTCTTC

CGCTCCTT

..CCTTC..

..ACTTG..

13 /27

ACTTCTTC

CGCTCCTT

..CCTTC..

..ACTTG..

13 /27

ACTTCTTC

CGCTCCTT

13 /27

3. find Eulerian walk

ACTTCTTC

CGCTCCTT

13 /27

ACTTCTTC

CGCTCCTT

Running Time

#k-mers = O(#reads · read-length)

1. O(1) per k-mer

2. O(1) per k-mer

3. O(size of graph) = O(#k-mers)

Note: edges can be weighted by #occurances

13 /27

ACTTCTTC

CGCTCCTT

Problems

- choose k well

I k too small small repeats become problemsI k too big miss smaller overlaps

- Eulerian walk not neccessarily unique

- some paths in DeBruijn graph inconsistent with reads

- read-errors problematic

error-correction step before assembling

- same problem with repeats as OLC

13 /27

Correcting Read Errors in Suffix Trees

Example

CAACTTACCAACTCAACAACTTACCTACTTAC

Idea: low freq. w/ high freq. neighbor ignore branch

14 /27

Example

14 /27

Example

14 /27

Example

14 /27

Correcting Read Errors in DBG

“faulty” k-mers occur less often

than correct ones

build k-mer count histogram

De Bruijn

have to treat errors before

building the graph

k=30 & 1% error 1/4 faulty

15 /27

Correcting k-mer Errors

Example

suppose: avg. k-mer count = 10 each k-mer occurs about 10x

GCGTATTACGCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACG 12x

TTACGC 9x

TACGCG 9x

ACGCGT 10x

CGCGTC 11x

GCGTCT 10x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

GCGTATTACTCGTCTGGCCTCGTATT 8x

GTATTA 9x

TATTAC 7x

ATTACT 1x

TTACTC 2x

TACTCG 2x

ACTCGT 1x

CTCGTC 1x

TCGTCT 1x

CGTCTG 9x

GTCTGG 10x

TCTGGC 10x

CTGGCC 11x

TGGCCT 9x

16 /27

17 /27

Problem

now we have an idea where an error is, but how to fix it?

errors turn frequent k-mers into infrequent ones

correction should turn infrequent k-mers into frequent ones

replace infrequent k-mer by “frequent neighbor”

18 /27

Intro: Genome Scaffolding

Recall: repeats (common in DNA) make assembly ambiguous

end product is a set of “contiguous regions”

Problem: “contig soup” not very useful

But: with NGS, we have paired-end information!

Scaffolding + Filling

Scaffolding

Goal: order & orient contigs

Idea: use pairing information on reads to “link” contigs together

19 /27

Scaffolding

19 /27

Scaffolding

19 /27

Scaffolding

19 /27

Scaffolding

19 /27

Graph-Based Scaffolding

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CTAGGCCATTGATTGCGGGTCCAGGTGCTG

GTTAATGTCCGAGCATAAAACTCTGGTTGGC

GTACTGAACTTGGGTTCCATAGGACCCAGA

AGAGCTTGACAGTAACACATTTAGGAGCACGCG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Strategy

1. map reads into contigs

2. pair contigs according to read-pairing (weighted)

3. cover “scaffold graph” with (heavy) alternating paths

each path corresponds to a chromosome

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGG

GTTTTTAC

CTACTGA

Strategy

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Strategy

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Strategy

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Strategy

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Strategy

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Strategy

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Strategy

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp ∈ N

Question: Can M be covered by

≤σp alternating paths

σc alternating cycles

of total weight ≥ k?

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

≤σp alternating paths &

≤σc alternating cycles

AACGACAC

TCCTTGGG

TTTTTACG

TCGCGG

CGTCGCGG

ACTTGGGGTTTTTAC

CTACTGA

Exact ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N

σp alternating paths &

σc alternating cycles

21 /27

Hardness Warm up: Hamiltonian Path

Recall: Scaffolding

Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &

≤ σc alternating cycles of total weight ≥ k?

Construction

Given a directed graph D1. make a copy of D

2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

Recall: Scaffolding

Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M

3. “slide” down all arrow tips & ignore directions

Recall: Scaffolding

Construction

Given a directed graph D1. make a copy of D2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions

Recall: Scaffolding

D admits a directed Hamiltonian path ⇔ M can be covered with a

single alternating path in G

Recall: Scaffolding

single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh

alternating X covers M X

Recall: Scaffolding

single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh

alternating X covers M X

Recall: Scaffolding

single alternating path in G“⇐”: contract each matching edge in the covering alternating path

hits all vertices exactly once X is valid directed path X

Recall: Scaffolding

single alternating path in G“⇐”: contract each matching edge in the covering alternating path

hits all vertices exactly once X is valid directed path X

Recall: Scaffolding

Theorem

Scaffolding is NP-hard, even restricted to

• bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0}

Corollary

Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.

Recall: Scaffolding

Theorem

• supergraphs of bipartite graphs

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

sufficiently dense graph class.

Recall: Scaffolding

Theorem

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

sufficiently dense graph class.22/27

Recall: Scaffolding

Theorem

Exact Scaffolding is NP-hard, even restricted to

• (σp, σc) ∈ {(0, 1), (1, 0)} and

• ω : E → {0, 1}

Corollary

Exact Scaffolding with 2 weights is NP-hard in any

sufficiently dense graph class.22/27

3-Approximation in Dense Graphs

σp = 1, σc = 1?

Approximate Scaffolding

1. sort all edges by weight

2. repeatedly take heaviest edge, if

possible

σp = 1, σc = 1?

Approximate Scaffolding1. sort all edges by weight

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

σp = 1, σc = 1?

possible

Result S∗ is a valid solution X

Note: taking an edge forbids ≤ 3 OPT edges

mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them

3ω(S∗) ≥ OPT

σp = 1, σc = 1?

possible

Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges

3ω(S∗) ≥ OPT

σp = 1, σc = 1?

possible

3ω(S∗) ≥ OPT

σp = 1, σc = 1?

possible

3ω(S∗) ≥ OPT

σp = 1, σc = 1?

possible

3ω(S∗) ≥ OPT

σp = 1, σc = 1?

possible

Theorem

Scaffolding in complete graphs can be

3-approximated in O(|V |2 log |V |) time.

Remark

For Exact Scaffolding, we have to keep an eye on the number of

components too.

σp = 1, σc = 1?

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

Remark

components too.

σp = 1, σc = 1?

possible

Theorem

Scaffolding in complete (bipartite) graphs can be

Remark

components too.

σp = 1, σc = 1?

Approximate Exact Scaffolding

1. compute max.-weight perfect

matching S S ∪M is collection of cycles

2. “fix” all but lightest edge per cycle

3. repeatedly flip any lightest non-fix

4-cycle intersecting 2 cycles

until at most σc + σp cycles remain

4. repeatedly remove lightest non-fix

cycle-edge

until at most σc cycles remain

σp = 1, σc = 1?

Approximate Exact Scaffolding1. compute max.-weight perfect

cycle-edge

σp = 1, σc = 1?

cycle-edge

σp = 1, σc = 1?

cycle-edge

σp = 1, σc = 1?

cycle-edge

σp = 1, σc = 1?

cycle-edge

σp = 1, σc = 1?

cycle-edge

σp = 1, σc = 1?

cycle-edge

σp = 1, σc = 1?

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution X

ω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

σp = 1, σc = 1?

cycle-edge

until at most σc cycles remainProof

Result S∗ is a valid solution Xω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2

σp = 1, σc = 1?

cycle-edge

until at most σc cycles remainTheorem

Exact Scaffolding in complete graphs can be

2-approximated in O(|V |2.5) time.

Remark

For Scaffolding, replace Step 3 by either merging cycles or

removing lightest edge, whatever looses less weight

σp = 1, σc = 1?

cycle-edge

Exact Scaffolding in complete (bipartite) graphs can be

Remark

σp = 1, σc = 1?

cycle-edge

Exact Scaffolding in complete (bipartite) graphs can be

Remark

Scaffolding with Multiplicities

Recall: most eucaryotes are diploid!

GGTGCGAGAGAGGTCATGGATTGCAACGA

GGTGCGAGAGGCCACTCCAATTGCAACGA

Linearization of Solutions

Problem

no unique chromosome-configuration explaining solution

Problem

Linearization of SolutionsTheorem

(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)

3 2 2 1 3 3 5 5 71

2 1 2 1

“⇒”: contraposition; let p = ambigous path

2 2 21

(G ,M,m) not uniquely linearizable

2 2 21

“⇐”: let (G ,M,m) be free of ambigous paths

Reduction (does not decrease number of linearizations):

result is collection of alternating paths & cycles

(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths

Idea: remove non-matching edges at their endpoints; strategy?

Proposals

1. decide arbitrarily

missassembly

2. isolate each ambiguous path

information loss

3. cut as few ends as possible

as hard as Vertex Cover

4. cut as few multiplicities as possible

as hard as Trans. Del. (∆-free)

Proposals

1. decide arbitrarily

missassembly

information loss

Proposals

1. decide arbitrarily missassembly

information loss

Proposals

information loss

Proposals

2. isolate each ambiguous path information loss

Proposals

3. cut as few ends as possible as hard as Vertex Cover

Proposals

Multiplicitiesone &

#non-matching adj. to contig

Proposals

Multiplicitiesone &

Proposals

Multiplicitiesone &

Proposals

Multiplicitiesone &

Proposals

4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)

Proposals

Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture...

Documents

Journée Marie-Claude DAVID - Université Paris-Saclay · 2019. 12. 4. · Journée Marie-Claude DAVID - Orsay - 4 février 2018 Magdalena.kobylanski@u-pem.fr ea ls nie,-tssaon duon

Presentation2 - 3.imimg.com · I PRODUCT INTRODUCTIONI MAO Standard Series Twin Shaft Compulsory Concrete Mixer MAO 150082000/3000/4000/4500 Special Features Air purged multi seals

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 1, NO. 1 ...perso.u-pem.fr/~langar/papers/TCC13.pdf · and location constraints. Obviously, the embedding scheme must ensure that the capacity

UNE NOUVELLE MÉTAPHORE MUSICALE Correction 2005 - University of Paris-Est Marne-la ...perso.u-pem.fr/lalibert/Exemples_fichiers/Pdf/Supercordes... · 2018. 12. 17. · 2 Xenakis,

Michel Bulteau COMPLEXITY AND GRACE - Philip Taaffephiliptaaffe.info/wp-content/uploads/BULTEAU-2001.pdf · MICHEL BULTEAU | Complexity and Grace (2001) page 2 of 7 In Terpsichore

Algorithms and Bioinformaticsigm.univ-mlv.fr/~bulteau/teaching/ens/slides1.pdf · Introduction Permutations Context and motivations Comparing genomes Genome rearrangements I Our problem:

Continuous Cauchy wavelet transform analyses of EXAFS spectra: …perso.u-pem.fr/farges/wav/munoz03.pdf · 2003-11-10 · ous wavelet analysis is the determination of “hidden”

Pancake Flipping Is Hard - arXiv.org e-Print archive · 2011. 11. 11. · Pancake Flipping Is Hard Laurent Bulteau, Guillaume Fertin, Irena Rusu Laboratoire d’Informatique de Nantes-Atlantique

Relative entropy method for the diffusive limit of ...mhvignal/exposes/Berthon_Toulouse_2… · Christophe Berthon1, with Marianne Bessemoulin-Chatard, Sol ene Bulteau and H el ene

Increase your productivity with user-friendly automation tools … · Introduction IntroductionI Automation tools presentation - Lizebeth Koloko-Green. Feel free to share on social

AD 262 - Defense Technical Information Center for measurements in the visible and ultraviolet regions of the spectrum. INTRODUCTIONi slope, so that reflectan'ce measurements have

AUGIZEAU BOUQUIN BROSSARD BULTEAU COFFY · PDF fileMINARD Maelle MOUTON Coralie PLISSON Mathilde POUCIN Valéran RAGUSA Charles ... TARDIF Justine VILLAIN Maxime lundi 3 septembre

Cluster-Based Resource Management in OFDMA Femtocell ...perso.u-pem.fr/~langar/papers/TVT13.pdf · the cellular network, most notably, resource allocation and interference management

Michel Bulteau Crysta Is to : A d efn - Dashboard · The cathedral-eyes of the complaint-fairies. Without psalms' coffin's brake. THE SNOW OF THE BELLY BUTTON PALACES. At the eternal

IEEE TRANSACTIONS ON MOBILE COMPUTING An Operations ...perso.u-pem.fr/~langar/papers/TMC14.pdfAn Operations Research Game Approach for Resource and Power Allocation in Cooperative

Introductioni - PCCspot.pcc.edu/~lkidoguc/PAMBD/D_Y2_IyengarWay.pdfeldest daughter, Geeta, and son, Prashant, carry on the teaching tradition. Geeta Iyengar is much re-spectted as

Kangaroo Bond Issuance in Australia · Kangaroo Bond Issuance in Australia 1. Introductioni ... preferred currency and cash flow pattern of the issuer. In that sense the deep foreign

Introductioni-one-interiors.com/.../I-ONE-COMPANY-PROFILE-VER.2020.pdfAs a group company, I-One Interiors / Mazcot Carpentry & Decor pride themselves on excellence in delivering the

ON THE FUNCTIONALITY OF LANGUAGEjournals.linguisticsociety.org/elanguage/pip/article/...ON THE FUNCTIONALITY OF LANGUAGE Jan Nuyts O. INTRODUCTIONI Functionalism in the language sciences

Presentación de PowerPoint€¦ · I. IntroductionI. Introduction AssumptionsAssumptions The Social pharmacology Studies the “cycle of life” of any marketed drug product in the