48
From Indexing Data Structures to de Bruijn Graphs Bastien Cazaux, Thierry Lecroq, Eric Rivals LIRMM & IBC, Montpellier - LITIS Rouen June 15, 2014 Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 1 / 35

From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From Indexing Data Structures to de Bruijn Graphs

Bastien Cazaux, Thierry Lecroq, Eric Rivals

LIRMM & IBC, Montpellier - LITIS Rouen

June 15, 2014

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 1 / 35

Page 2: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Introduction

Motivation

De Bruijn Graph is largely used in de novo assembly. [Pevzner et al.,2001]

One builds a suffix tree before the assembly for some applications, forinstance for the error correction. [Salmela, 2010]

There exist algorithms to build directly the De Bruijn Graph.[Onodera et al., 2013] [Rødland, 2013]

None is able to build the Contracted De Bruijn Graph directly.

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 2 / 35

Page 3: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Introduction

Indexing data structures

Numerous data structures: suffix tree, affix tree, suffix array, etc.

to index one or several texts (generalized index)

functionally equivalent to

compressed versions (CSA, FM-index, CST, etc)

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 3 / 35

Page 4: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Introduction

Relation between indexing structures and assembly graphs

Generalised Suffix Tree (GST)

indexes all factors of a set of texts

is functionally equivalent to other indexes (SA, CSA, FM-index)

Question: How to directly build, e.g., the assembly De Bruijn graph?

in classical or contracted form?

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 4 / 35

Page 5: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Introduction

Relation between indexing structures and assembly graphs

Generalised Suffix Tree (GST)

indexes all factors of a set of texts

is functionally equivalent to other indexes (SA, CSA, FM-index)

Question: How to directly build, e.g., the assembly De Bruijn graph?

in classical or contracted form?

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 4 / 35

Page 6: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Generalized Suffix Tree (GST)

1 Generalized Suffix Tree (GST)

2 De Bruijn Graphs

3 From the Generalized Suffix Tree to the DBG

4 Contracted De Bruijn Graph

5 Conclusion and Future works

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 5 / 35

Page 7: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Generalized Suffix Tree (GST)

Generalized Suffix Tree (GST)

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 5 / 35

Page 8: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Generalized Suffix Tree (GST)

Example

S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}

||S || =∑

si∈S |si |

||S || = 7 + 5 + 6 + 7 + 6 = 31

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 6 / 35

Page 9: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Generalized Suffix Tree (GST)

Suffix Tree

6

$

1

$acba

3

ca$

b

a

2

$acba

4

ca$

b

5

ca$

7

$

a1b2a3b4c5a6$7

6

1

acba

3

ca

b

a

2

acba

4

cab

5

ca

a1b2a3b4c5a6

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 7 / 35

Page 10: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Generalized Suffix Tree (GST)

Feature1 take in input a set of words

2 no use an end marking special symboli.e. no hypothesis requiring that a suffix is not the prefix of anothersuffix

4

4

$

−→

Theorem

The GST of a set of words S takes linear space in ||S ||.

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 8 / 35

Page 11: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Generalized Suffix Tree (GST)

Generalized Suffix Tree (GST)

7 7

6 6

3

3

bc

a

5

3

caab

4

4

3

a

2

ba

b

c

a

6 6

5

2

c

a

4

2

caab

2

a

1

b

cba

a

4

1

cb

caa

b

5

5

2

cb

aa

5

4

1

c

a

3

1

caab

ab

c

1

bacb

aa

S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 9 / 35

Page 12: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

De Bruijn Graphs

De Bruijn Graphs

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 10 / 35

Page 13: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

De Bruijn Graphs

De Bruijn Graph

The assembly De Bruijn Graph (DBG+k )

Let k be a positive integer satisfying k ≥ 2 andS = {s1, . . . , sn} be a set of n words.The De Bruijn graph of order k of S , denoted by DBG+

k , is a graph suchthat

the nodes are the k-mers of S and

an arc links two k-mers of S if there exists an integer i such thatthese k-mers start at successive positions in si .

Remark: the arc definition implies that the two k-mers overlap by k − 1positions.

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 11 / 35

Page 14: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

De Bruijn Graphs

De Bruijn Graph for assembly

bb cb

ba ac

aa ca

ab bc

S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 12 / 35

Page 15: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

From the Generalized Suffix Tree to the DBG

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 13 / 35

Page 16: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

One defines 3 specific types of nodes in the GST (in VS):

|u| < k

|v| ≥ k

v initial node

|v| = k v initial exact node

|v| = k − 1 v subinitial node

Some properties

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 14 / 35

Page 17: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Generalized Suffix Tree (GST)

7 7

6 6

3

3

bc

a

5

3

caa

b

4

4

3

a

2

ba

b

c

a

6 6

5

2

c

a

4

2

caab

2

a

1

b

cba

a

4

1

cb

caa

b

5

5

2

cb

aa

5

4

1

c

a

3

1

caab

ab

c

1

bacb

aa

S = {bbacbaa, cbaac, bacbab, cbabcaa, bcaacb} and k = 2

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 15 / 35

Page 18: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Generalized Suffix Tree (GST)

7 7

6 6

3

3

bc

a

5

3

caab

4

4

3

a

2

ba

b

c

a

6 6

5

2

c

a

4

2

caab

2

a

1

b

cba

a

4

1

cb

caa

b

5

5

2

cb

aa

5

4

1

c

a

3

1

caab

ab

c

1

bacb

aa

initial node

subinitial node

initial exact node

S = {bbacbaa, cbaac, bacbab, cbabcaa, bcaacb} and k = 2

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 15 / 35

Page 19: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Nodes of the de Bruijn Graph

Notation: InitSLet InitS denote the set of initial nodes of the GST of S .

Property: node correspondence

The set of k-mers of DBG+k of S is isomorphic to InitS .

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 16 / 35

Page 20: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Arcs of the de Bruijn Graph

Idea1 Take an initial node v

2 follow its suffix link to node z (lose the first letter of its k-mer)

3 if needed, go the children of z to find its extensions

4 check whether the extensions are valid

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 17 / 35

Page 21: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Let v be an initial node, u its father, and z the node pointed at by thesuffix link of v .

Property 2

Let v be a node of suffix tree. If it exists, the suffix link of v belongs tothe sub-tree of the suffix link of par(v).

u

v

SL(u)

z = SL(v)

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 18 / 35

Page 22: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Arcs of DBG+k : case figure

v starting initial node: green

z := SL(v) node pointed at by its suffix link: orange

black straight arrows : arcs of ST

dotted arrows: suffix links

colored plain arrows: created arcs of the DBG+k

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 19 / 35

Page 23: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Arcs of DBG+k : Type 1

v is initial not exact &z is initial 6 6

b4

caa

1

cb

5

5

2

c

aa

cb

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35

Page 24: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Arcs of DBG+k : Type 2

v is initial not exact &z is not initiali.e. z is deeper than worddepth k

6 6

5

2

c

a

4

2

caa

b

2

a

1

b

cbaa

b

1

bacb

aa

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35

Page 25: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Arcs of DBG+k : Type 3

v is initial exactwith a single child

7 7

a

4

c

4b

a

3

a

2

b

5

5 5

2

c

aa

b

cb a

4

a

3

b

1

c

1

caa

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35

Page 26: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

Arcs of DBG+k : Type 4

v is initial exactwith a several children

7 7

a

4c

4

ba

3

a

6 6

a

3

c

3

b

5

b

3

caa

2

b

6 6

b

a

5

a

cba

4

b

2

c

2

caa

2

a

1

b

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35

Page 27: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

DBG construction

Theorem

Given the GST of a set of words S.The construction of the De Bruijn Graph takes linear time in ||S ||.

Proof

All four cases of the typology are processed in constant time.

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 21 / 35

Page 28: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

From the Generalized Suffix Tree to the DBG

DBG construction

Theorem

Given the GST of a set of words S.The construction of the De Bruijn Graph takes linear time in ||S ||.

Proof

All four cases of the typology are processed in constant time.

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 21 / 35

Page 29: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Contracted De Bruijn Graph

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 22 / 35

Page 30: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Example

not contracted

bba

bac acb cba bab

aac baa abc

caa bca

contracted

bbac acb cba

aac baa

babcaa

S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 23 / 35

Page 31: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Left and right extensible nodes

Left or right extensible node

Let v be a node of the De Bruijn graph; we say that

v is left extensible if and only if d in(v) = 1.

v is right extensible if and only if dout(v) = 1.

u v

din(u) = 2

dout(u) = 1

din(v) = 1

dout(v) = 3

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 24 / 35

Page 32: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Contracted De Bruijn Graph

Contracted De Bruijn Graph

Let G := (VG ,EG ) be the De Bruijn graph of order k of S . We callK := (VK ,EK ) the Contracted De Bruijn Graph (CDBG+

k ) of order k ofS , the graph obtained from G by contracting iteratively the arcs (u, v)where u is right extensible and v is left extensible.

u v

w

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 25 / 35

Page 33: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Find right extensible nodes

Property 4: right extensible

Initial nodes of type 1, 2 or 3 are right-extensible.

Initial nodes of type 4 are not right-extensible.

Proof

Types 1, 2 or 3: word of node v has only one extension in S .

Types 4 nodes have several extensions in S .

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 26 / 35

Page 34: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Find right extensible nodes

Property 4: right extensible

Initial nodes of type 1, 2 or 3 are right-extensible.

Initial nodes of type 4 are not right-extensible.

Proof

Types 1, 2 or 3: word of node v has only one extension in S .

Types 4 nodes have several extensions in S .

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 26 / 35

Page 35: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Case of right extensible nodes ex. for k = 2

7 7

6 6

3

3

bc

a

5

3

caab

4

4

3

a

2

ba

b

c

a

6 6

5

2

c

a

4

2

caab

2

a

1

b

cba

a

4

1

cb

caa

b

5

5

2

cb

aa

5

4

1

c

a

3

1

caab

ab

c

1

bacb

aa

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 27 / 35

Page 36: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Find left extensible nodesLet v be a right extensible k-mer/node and (v , u) an arc of DBG+

k .Question: is u left extensible?

Property 5: left extensible

u is left extensible if and only if V suff(u)− V ter(u) = V suff(v)

6 6

u

5

2

c

a

4

2

caab

2

a

1

b

cba

a

b

1

bacb

aa V suff(1) = {1}

V ter(1) = {1}

V suff(u) = {5, 2, 4, 2, 2, 1}

V ter(u) = {1}

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 28 / 35

Page 37: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Find left extensible nodesLet v be a right extensible k-mer/node and (v , u) an arc of DBG+

k .Question: is u left extensible?

Property 5: left extensible

u is left extensible if and only if V suff(u)− V ter(u) = V suff(v)

Idea

determine how many nodes in sub-tree of u have Suffix Links pointingto them coming from the sub-tree of v

do not account for nodes representing the first suffix of words in Swhich we call terminal nodes

compare the cardinalities of non terminal suffixes in u sub-tree withsuffixes in v sub-tree.

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 28 / 35

Page 38: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

u is left extensible

vu

w

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35

Page 39: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

u is left extensible

vu

w

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35

Page 40: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

u is left extensible

vu

w

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35

Page 41: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

u is left extensible

vu

w

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35

Page 42: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

u is not left extensible

vu

w

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35

Page 43: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Case of left extensible nodes ex. for k = 2

7 7

6 6

3

3

bc

a

5

3

caab

4

4

3

a

2

ba

b

c

a

6 6

5

2

c

a

4

2

caab

2

a

1

b

cba

a

4

1

cb

caa

b

5

5

2

cb

aa

5

4

1

c

a

3

1

caab

ab

c

1

bacb

aa

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 30 / 35

Page 44: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Contracted De Bruijn Graph

Contracted DBG construction

Theorem

Given the GST of a set of words S.The construction of the Contracted DBG+

k takes linear time in ||S ||.

Idea of proof.

Detection of right extensible nodes is obtained from the 4 types.

To get left and right extensible nodes in constant time preprocess:

V suff(u) and V ter(u) for each node u of GST, and

for each node w of ST such that |w | ≥ ka pointer to its ancestral initial node.

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 31 / 35

Page 45: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Conclusion and Future works

Conclusion and Future works

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 32 / 35

Page 46: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Conclusion and Future works

Conclusion

A linear time algorithm that

builds the Contracted De Bruijn Graph

from a Generalized Suffix Tree or a Suffix Array

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 33 / 35

Page 47: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Conclusion and Future works

Future works

Use only a slice of the suffix tree

Update the order of the DBG: dynamically changing k

Go for compressed indexes instead of a Suffix Tree

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 34 / 35

Page 48: From Indexing Data Structures to de Bruijn Graphsstelo/cpm/cpm14/10_cazaux.pdfDe Bruijn Graphs De Bruijn Graph The assembly De Bruijn Graph (DBG+ k) Let k be a positive integer satisfying

Conclusion and Future works

Funding and acknowledgments

Thanks for your attention

Questions?

Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 35 / 35