Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
From Indexing Data Structures to de Bruijn Graphs
Bastien Cazaux, Thierry Lecroq, Eric Rivals
LIRMM & IBC, Montpellier - LITIS Rouen
June 15, 2014
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 1 / 35
Introduction
Motivation
De Bruijn Graph is largely used in de novo assembly. [Pevzner et al.,2001]
One builds a suffix tree before the assembly for some applications, forinstance for the error correction. [Salmela, 2010]
There exist algorithms to build directly the De Bruijn Graph.[Onodera et al., 2013] [Rødland, 2013]
None is able to build the Contracted De Bruijn Graph directly.
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 2 / 35
Introduction
Indexing data structures
Numerous data structures: suffix tree, affix tree, suffix array, etc.
to index one or several texts (generalized index)
functionally equivalent to
compressed versions (CSA, FM-index, CST, etc)
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 3 / 35
Introduction
Relation between indexing structures and assembly graphs
Generalised Suffix Tree (GST)
indexes all factors of a set of texts
is functionally equivalent to other indexes (SA, CSA, FM-index)
Question: How to directly build, e.g., the assembly De Bruijn graph?
in classical or contracted form?
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 4 / 35
Introduction
Relation between indexing structures and assembly graphs
Generalised Suffix Tree (GST)
indexes all factors of a set of texts
is functionally equivalent to other indexes (SA, CSA, FM-index)
Question: How to directly build, e.g., the assembly De Bruijn graph?
in classical or contracted form?
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 4 / 35
Generalized Suffix Tree (GST)
1 Generalized Suffix Tree (GST)
2 De Bruijn Graphs
3 From the Generalized Suffix Tree to the DBG
4 Contracted De Bruijn Graph
5 Conclusion and Future works
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 5 / 35
Generalized Suffix Tree (GST)
Generalized Suffix Tree (GST)
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 5 / 35
Generalized Suffix Tree (GST)
Example
S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}
||S || =∑
si∈S |si |
||S || = 7 + 5 + 6 + 7 + 6 = 31
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 6 / 35
Generalized Suffix Tree (GST)
Suffix Tree
6
$
1
$acba
3
ca$
b
a
2
$acba
4
ca$
b
5
ca$
7
$
a1b2a3b4c5a6$7
6
1
acba
3
ca
b
a
2
acba
4
cab
5
ca
a1b2a3b4c5a6
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 7 / 35
Generalized Suffix Tree (GST)
Feature1 take in input a set of words
2 no use an end marking special symboli.e. no hypothesis requiring that a suffix is not the prefix of anothersuffix
4
4
$
−→
Theorem
The GST of a set of words S takes linear space in ||S ||.
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 8 / 35
Generalized Suffix Tree (GST)
Generalized Suffix Tree (GST)
7 7
6 6
3
3
bc
a
5
3
caab
4
4
3
a
2
ba
b
c
a
6 6
5
2
c
a
4
2
caab
2
a
1
b
cba
a
4
1
cb
caa
b
5
5
2
cb
aa
5
4
1
c
a
3
1
caab
ab
c
1
bacb
aa
S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 9 / 35
De Bruijn Graphs
De Bruijn Graphs
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 10 / 35
De Bruijn Graphs
De Bruijn Graph
The assembly De Bruijn Graph (DBG+k )
Let k be a positive integer satisfying k ≥ 2 andS = {s1, . . . , sn} be a set of n words.The De Bruijn graph of order k of S , denoted by DBG+
k , is a graph suchthat
the nodes are the k-mers of S and
an arc links two k-mers of S if there exists an integer i such thatthese k-mers start at successive positions in si .
Remark: the arc definition implies that the two k-mers overlap by k − 1positions.
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 11 / 35
De Bruijn Graphs
De Bruijn Graph for assembly
bb cb
ba ac
aa ca
ab bc
S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 12 / 35
From the Generalized Suffix Tree to the DBG
From the Generalized Suffix Tree to the DBG
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 13 / 35
From the Generalized Suffix Tree to the DBG
One defines 3 specific types of nodes in the GST (in VS):
|u| < k
|v| ≥ k
v initial node
|v| = k v initial exact node
|v| = k − 1 v subinitial node
Some properties
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 14 / 35
From the Generalized Suffix Tree to the DBG
Generalized Suffix Tree (GST)
7 7
6 6
3
3
bc
a
5
3
caa
b
4
4
3
a
2
ba
b
c
a
6 6
5
2
c
a
4
2
caab
2
a
1
b
cba
a
4
1
cb
caa
b
5
5
2
cb
aa
5
4
1
c
a
3
1
caab
ab
c
1
bacb
aa
S = {bbacbaa, cbaac, bacbab, cbabcaa, bcaacb} and k = 2
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 15 / 35
From the Generalized Suffix Tree to the DBG
Generalized Suffix Tree (GST)
7 7
6 6
3
3
bc
a
5
3
caab
4
4
3
a
2
ba
b
c
a
6 6
5
2
c
a
4
2
caab
2
a
1
b
cba
a
4
1
cb
caa
b
5
5
2
cb
aa
5
4
1
c
a
3
1
caab
ab
c
1
bacb
aa
initial node
subinitial node
initial exact node
S = {bbacbaa, cbaac, bacbab, cbabcaa, bcaacb} and k = 2
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 15 / 35
From the Generalized Suffix Tree to the DBG
Nodes of the de Bruijn Graph
Notation: InitSLet InitS denote the set of initial nodes of the GST of S .
Property: node correspondence
The set of k-mers of DBG+k of S is isomorphic to InitS .
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 16 / 35
From the Generalized Suffix Tree to the DBG
Arcs of the de Bruijn Graph
Idea1 Take an initial node v
2 follow its suffix link to node z (lose the first letter of its k-mer)
3 if needed, go the children of z to find its extensions
4 check whether the extensions are valid
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 17 / 35
From the Generalized Suffix Tree to the DBG
Let v be an initial node, u its father, and z the node pointed at by thesuffix link of v .
Property 2
Let v be a node of suffix tree. If it exists, the suffix link of v belongs tothe sub-tree of the suffix link of par(v).
u
v
SL(u)
z = SL(v)
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 18 / 35
From the Generalized Suffix Tree to the DBG
Arcs of DBG+k : case figure
v starting initial node: green
z := SL(v) node pointed at by its suffix link: orange
black straight arrows : arcs of ST
dotted arrows: suffix links
colored plain arrows: created arcs of the DBG+k
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 19 / 35
From the Generalized Suffix Tree to the DBG
Arcs of DBG+k : Type 1
v is initial not exact &z is initial 6 6
b4
caa
1
cb
5
5
2
c
aa
cb
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35
From the Generalized Suffix Tree to the DBG
Arcs of DBG+k : Type 2
v is initial not exact &z is not initiali.e. z is deeper than worddepth k
6 6
5
2
c
a
4
2
caa
b
2
a
1
b
cbaa
b
1
bacb
aa
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35
From the Generalized Suffix Tree to the DBG
Arcs of DBG+k : Type 3
v is initial exactwith a single child
7 7
a
4
c
4b
a
3
a
2
b
5
5 5
2
c
aa
b
cb a
4
a
3
b
1
c
1
caa
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35
From the Generalized Suffix Tree to the DBG
Arcs of DBG+k : Type 4
v is initial exactwith a several children
7 7
a
4c
4
ba
3
a
6 6
a
3
c
3
b
5
b
3
caa
2
b
6 6
b
a
5
a
cba
4
b
2
c
2
caa
2
a
1
b
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 20 / 35
From the Generalized Suffix Tree to the DBG
DBG construction
Theorem
Given the GST of a set of words S.The construction of the De Bruijn Graph takes linear time in ||S ||.
Proof
All four cases of the typology are processed in constant time.
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 21 / 35
From the Generalized Suffix Tree to the DBG
DBG construction
Theorem
Given the GST of a set of words S.The construction of the De Bruijn Graph takes linear time in ||S ||.
Proof
All four cases of the typology are processed in constant time.
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 21 / 35
Contracted De Bruijn Graph
Contracted De Bruijn Graph
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 22 / 35
Contracted De Bruijn Graph
Example
not contracted
bba
bac acb cba bab
aac baa abc
caa bca
contracted
bbac acb cba
aac baa
babcaa
S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 23 / 35
Contracted De Bruijn Graph
Left and right extensible nodes
Left or right extensible node
Let v be a node of the De Bruijn graph; we say that
v is left extensible if and only if d in(v) = 1.
v is right extensible if and only if dout(v) = 1.
u v
din(u) = 2
dout(u) = 1
din(v) = 1
dout(v) = 3
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 24 / 35
Contracted De Bruijn Graph
Contracted De Bruijn Graph
Contracted De Bruijn Graph
Let G := (VG ,EG ) be the De Bruijn graph of order k of S . We callK := (VK ,EK ) the Contracted De Bruijn Graph (CDBG+
k ) of order k ofS , the graph obtained from G by contracting iteratively the arcs (u, v)where u is right extensible and v is left extensible.
u v
↓
w
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 25 / 35
Contracted De Bruijn Graph
Find right extensible nodes
Property 4: right extensible
Initial nodes of type 1, 2 or 3 are right-extensible.
Initial nodes of type 4 are not right-extensible.
Proof
Types 1, 2 or 3: word of node v has only one extension in S .
Types 4 nodes have several extensions in S .
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 26 / 35
Contracted De Bruijn Graph
Find right extensible nodes
Property 4: right extensible
Initial nodes of type 1, 2 or 3 are right-extensible.
Initial nodes of type 4 are not right-extensible.
Proof
Types 1, 2 or 3: word of node v has only one extension in S .
Types 4 nodes have several extensions in S .
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 26 / 35
Contracted De Bruijn Graph
Case of right extensible nodes ex. for k = 2
7 7
6 6
3
3
bc
a
5
3
caab
4
4
3
a
2
ba
b
c
a
6 6
5
2
c
a
4
2
caab
2
a
1
b
cba
a
4
1
cb
caa
b
5
5
2
cb
aa
5
4
1
c
a
3
1
caab
ab
c
1
bacb
aa
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 27 / 35
Contracted De Bruijn Graph
Find left extensible nodesLet v be a right extensible k-mer/node and (v , u) an arc of DBG+
k .Question: is u left extensible?
Property 5: left extensible
u is left extensible if and only if V suff(u)− V ter(u) = V suff(v)
6 6
u
5
2
c
a
4
2
caab
2
a
1
b
cba
a
b
1
bacb
aa V suff(1) = {1}
V ter(1) = {1}
V suff(u) = {5, 2, 4, 2, 2, 1}
V ter(u) = {1}
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 28 / 35
Contracted De Bruijn Graph
Find left extensible nodesLet v be a right extensible k-mer/node and (v , u) an arc of DBG+
k .Question: is u left extensible?
Property 5: left extensible
u is left extensible if and only if V suff(u)− V ter(u) = V suff(v)
Idea
determine how many nodes in sub-tree of u have Suffix Links pointingto them coming from the sub-tree of v
do not account for nodes representing the first suffix of words in Swhich we call terminal nodes
compare the cardinalities of non terminal suffixes in u sub-tree withsuffixes in v sub-tree.
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 28 / 35
Contracted De Bruijn Graph
u is left extensible
vu
w
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35
Contracted De Bruijn Graph
u is left extensible
vu
w
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35
Contracted De Bruijn Graph
u is left extensible
vu
w
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35
Contracted De Bruijn Graph
u is left extensible
vu
w
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35
Contracted De Bruijn Graph
u is not left extensible
vu
w
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 29 / 35
Contracted De Bruijn Graph
Case of left extensible nodes ex. for k = 2
7 7
6 6
3
3
bc
a
5
3
caab
4
4
3
a
2
ba
b
c
a
6 6
5
2
c
a
4
2
caab
2
a
1
b
cba
a
4
1
cb
caa
b
5
5
2
cb
aa
5
4
1
c
a
3
1
caab
ab
c
1
bacb
aa
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 30 / 35
Contracted De Bruijn Graph
Contracted DBG construction
Theorem
Given the GST of a set of words S.The construction of the Contracted DBG+
k takes linear time in ||S ||.
Idea of proof.
Detection of right extensible nodes is obtained from the 4 types.
To get left and right extensible nodes in constant time preprocess:
V suff(u) and V ter(u) for each node u of GST, and
for each node w of ST such that |w | ≥ ka pointer to its ancestral initial node.
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 31 / 35
Conclusion and Future works
Conclusion and Future works
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 32 / 35
Conclusion and Future works
Conclusion
A linear time algorithm that
builds the Contracted De Bruijn Graph
from a Generalized Suffix Tree or a Suffix Array
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 33 / 35
Conclusion and Future works
Future works
Use only a slice of the suffix tree
Update the order of the DBG: dynamically changing k
Go for compressed indexes instead of a Suffix Tree
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 34 / 35
Conclusion and Future works
Funding and acknowledgments
Thanks for your attention
Questions?
Cazaux, Lecroq, Rivals (LIRMM) Generalized Suffix Tree & DBG June 15, 2014 35 / 35