67
A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń University of Warsaw LSD 2012 London, February 9, 2012 Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 1/24

A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

A Linear Time Algorithmfor Seeds Computation

Tomasz Kociumaka, Marcin Kubica, JakubRadoszewski, Wojciech Rytter, Tomasz Waleń

University of Warsaw

LSD 2012 London, February 9, 2012

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 1/24

Page 2: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a ab b b b

a a a b

One of the key concepts in text algorithms.

A

a a a a a a a a a a a a ab b b b bb

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/24

Page 3: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a ab b b ba a a b

One of the key concepts in text algorithms.

A

a a a a a a a a a a a a ab b b b bb

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/24

Page 4: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a ab b b ba a a b

Quasiperiodicity:

a a a a a a a a a a a a ab b b b

bb

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/24

Page 5: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a ab b b ba a a b

Quasiperiodicity:

a a a a a a a a a a a a ab b b b bb

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/24

Page 6: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Covers and seeds

Cover:

a a a a a a a a a a a ab b b b b ba b b

Each letter of the word is covered by an occurrenceof the cover.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/24

Page 7: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Covers and seeds

Cover:

a a a a a a a a a a a ab b b b b ba b b

Each letter of the word is covered by an occurrenceof the cover.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/24

Page 8: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Covers and seeds

Seed:

a a a a a a a a a a a ab b b b b b

a b b

Each letter of the word is covered by an occurrenceof the seed. The occurrences can be external.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/24

Page 9: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Covers and seeds

Seed:

a a a a a a a a a a a ab b b b b b

a b b

Each letter of the word is covered by an occurrenceof the seed. The occurrences can be external.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/24

Page 10: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

The main problem

Problem (Shortest-Seed)Given a word w of length n over an alphabet Σ,compute the shortest seed of w.

Problem (All-Seeds)Given a word w of length n over an alphabet Σ,compute an O(n)-sized representation of all theseeds of w.

Theorem (Our result)

The All-Seeds Problem for Σ ={

0, 1, . . . , nO(1)}

can be solved in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/24

Page 11: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

The main problem

Problem (Shortest-Seed)Given a word w of length n over an alphabet Σ,compute the shortest seed of w.

Problem (All-Seeds)Given a word w of length n over an alphabet Σ,compute an O(n)-sized representation of all theseeds of w.

Theorem (Our result)

The All-Seeds Problem for Σ ={

0, 1, . . . , nO(1)}

can be solved in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/24

Page 12: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

The main problem

Problem (Shortest-Seed)Given a word w of length n over an alphabet Σ,compute the shortest seed of w.

Problem (All-Seeds)Given a word w of length n over an alphabet Σ,compute an O(n)-sized representation of all theseeds of w.

Theorem (Our result)

The All-Seeds Problem for Σ ={

0, 1, . . . , nO(1)}

can be solved in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/24

Page 13: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Background

Seeds were introduced in 1993 by Iliopoulos,Moore & Park.In the same paper O(n log n)-time algorithmfor the All-Seeds Problem over fixed-sizealphabet was given.

No o(n log n) algorithm even for theShortest-Seed Problem for binary alphabet upto now, despite considerable effort given forthis problem. (Smyth, 2000).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/24

Page 14: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Background

Seeds were introduced in 1993 by Iliopoulos,Moore & Park.In the same paper O(n log n)-time algorithmfor the All-Seeds Problem over fixed-sizealphabet was given.No o(n log n) algorithm even for theShortest-Seed Problem for binary alphabet upto now, despite considerable effort given forthis problem. (Smyth, 2000).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/24

Page 15: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Background (cont’d)

An O(log n)-time PRAM algorithm forn processors, Ben-Amran et al., 1994.

For covers linear algorithms for similarproblems are known:

shortest covers of each prefix (Breslauer, 1992)all covers (Moore & Smyth, 1994)

Variants of seeds have been studied:approximate seeds (Christodoulakis et al., 2003)λ-seeds (Qing, Hui & Iliopoulos, 2006)

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/24

Page 16: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Background (cont’d)

An O(log n)-time PRAM algorithm forn processors, Ben-Amran et al., 1994.For covers linear algorithms for similarproblems are known:

shortest covers of each prefix (Breslauer, 1992)all covers (Moore & Smyth, 1994)

Variants of seeds have been studied:approximate seeds (Christodoulakis et al., 2003)λ-seeds (Qing, Hui & Iliopoulos, 2006)

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/24

Page 17: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Background (cont’d)

An O(log n)-time PRAM algorithm forn processors, Ben-Amran et al., 1994.For covers linear algorithms for similarproblems are known:

shortest covers of each prefix (Breslauer, 1992)all covers (Moore & Smyth, 1994)

Variants of seeds have been studied:approximate seeds (Christodoulakis et al., 2003)λ-seeds (Qing, Hui & Iliopoulos, 2006)

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/24

Page 18: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Constraints for seeds

Two different types of constraintsBorder constraints, easier

Maxgap constrains, harder

a a a a a a a a a a a ab b b b b

≤ 5

≤ 5

Maxgap is a maximal distance between the starting positionsof two consecutive occurrences of a given substring.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 7/24

Page 19: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Constraints for seeds

Two different types of constraintsBorder constraints, easierMaxgap constrains, harder

a a a a a a a a a a a ab b b b b

≤ 5

≤ 5

Maxgap is a maximal distance between the starting positionsof two consecutive occurrences of a given substring.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 7/24

Page 20: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Quasiseeds

The All-Seeds Problem can be linearly reducedto computing the maxgaps of all substrings(encoded in a suffix tree).No o(n log n) algorithm known.

Definition (Quasiseed)A substring v is a quasiseed of w if there are lessthan |v | letters both before its first occurrence andafter the last one and each letter between those twooccurrences is covered by an occurrence of v .

a a a a a a a a a a a ab b b b b b

all letters covered< 5

< 5

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 8/24

Page 21: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Quasiseeds

The All-Seeds Problem can be linearly reducedto computing the maxgaps of all substrings(encoded in a suffix tree).No o(n log n) algorithm known.

Definition (Quasiseed)A substring v is a quasiseed of w if there are lessthan |v | letters both before its first occurrence andafter the last one and each letter between those twooccurrences is covered by an occurrence of v .

a a a a a a a a a a a ab b b b b b

all letters covered< 5

< 5

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 8/24

Page 22: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Useful properties of quasiseeds

An O(n) representation on the suffix tree.

v

a

a

abaaaaa

non-quasiseeds

quasiseeds

aaaa aaaaa aaaaaa aaaaaaab b b

aaa aaaaab

aaa aaab

A family of ranges represented by the lower end(explicit node) and the depth of the upper end.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 9/24

Page 23: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Useful properties of quasiseeds

An O(n) representation on the suffix tree.

v

a

a

abaaaaa

non-quasiseeds

quasiseeds

aaaa aaaaa aaaaaa aaaaaaab b b

aaa aaaaab

aaa aaab

A family of ranges represented by the lower end(explicit node) and the depth of the upper end.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 9/24

Page 24: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Useful properties of quasiseeds (cont’d)

Lemma (Restricted-Quasiseeds)Given an integer d and a word w of length n, therepresentation of all quasiseeds of length in{d , d + 1, . . . , 2d} can be found in O(n) time.

The All-Seeds Problem can be linearly reducedto computing (the representation of) allquasiseeds.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 10/24

Page 25: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Useful properties of quasiseeds (cont’d)

Lemma (Restricted-Quasiseeds)Given an integer d and a word w of length n, therepresentation of all quasiseeds of length in{d , d + 1, . . . , 2d} can be found in O(n) time.

The All-Seeds Problem can be linearly reducedto computing (the representation of) allquasiseeds.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 10/24

Page 26: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Main problem

Problem (All-Quasiseeds)Given a word w of length n, compute therepresentation of all its quasiseeds.

Our solution is a recursive algorithm of a rathernonstandard structure.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 11/24

Page 27: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Main problem

Problem (All-Quasiseeds)Given a word w of length n, compute therepresentation of all its quasiseeds.

Our solution is a recursive algorithm of a rathernonstandard structure.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 11/24

Page 28: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm

Interval staircase (parametrized by an integer m).

w :

≥ 2m≥ 2m

Lemma (Short Quasiseeds)A substring v of length ≤ m is a quasiseed of w ifand only if it is a quasiseed of each substringcorresponding to an interval of the staircase.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/24

Page 29: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm

Interval staircase (parametrized by an integer m).

w :

≥ 2m

≥ 2m

Lemma (Short Quasiseeds)A substring v of length ≤ m is a quasiseed of w ifand only if it is a quasiseed of each substringcorresponding to an interval of the staircase.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/24

Page 30: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm

Interval staircase (parametrized by an integer m).

w :

≥ 2m≥ 2m

Lemma (Short Quasiseeds)A substring v of length ≤ m is a quasiseed of w ifand only if it is a quasiseed of each substringcorresponding to an interval of the staircase.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/24

Page 31: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm

Interval staircase (parametrized by an integer m).

w :

≥ 2m≥ 2m

Lemma (Short Quasiseeds)A substring v of length ≤ m is a quasiseed of w ifand only if it is a quasiseed of each substringcorresponding to an interval of the staircase.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/24

Page 32: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm (cont’d)

The total length of the intervals in a staircase(size of the staircase) is at least n.

If it were e.g. ≤ 12n, the recursion could yield alinear algorithm.We need to reduce the staircase.

not needed

=w :

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/24

Page 33: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm (cont’d)

The total length of the intervals in a staircase(size of the staircase) is at least n.If it were e.g. ≤ 12n, the recursion could yield alinear algorithm.

We need to reduce the staircase.

not needed

=w :

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/24

Page 34: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm (cont’d)

The total length of the intervals in a staircase(size of the staircase) is at least n.If it were e.g. ≤ 12n, the recursion could yield alinear algorithm.We need to reduce the staircase.

not needed

=w :

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/24

Page 35: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm (cont’d)

The total length of the intervals in a staircase(size of the staircase) is at least n.If it were e.g. ≤ 12n, the recursion could yield alinear algorithm.We need to reduce the staircase.

not needed

=w :

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/24

Page 36: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm (cont’d)

Outline:1 Find an appropriate reduced staircase2 Find the long quasiseeds (non-recursively)3 Find the short quasiseeds (by recursive calls)4 Merge the results of those calls

Main issueHow to find an appropriate staircase such that

the reduced staircase is small,long quasiseeds can be found in O(n).

Due to the Restricted-Quasiseeds Lemma,m = Θ(n) would suffice for the second condition.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 14/24

Page 37: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm (cont’d)

Outline:1 Find an appropriate reduced staircase2 Find the long quasiseeds (non-recursively)3 Find the short quasiseeds (by recursive calls)4 Merge the results of those calls

Main issueHow to find an appropriate staircase such that

the reduced staircase is small,long quasiseeds can be found in O(n).

Due to the Restricted-Quasiseeds Lemma,m = Θ(n) would suffice for the second condition.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 14/24

Page 38: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Structure of the algorithm (cont’d)

Outline:1 Find an appropriate reduced staircase2 Find the long quasiseeds (non-recursively)3 Find the short quasiseeds (by recursive calls)4 Merge the results of those calls

Main issueHow to find an appropriate staircase such that

the reduced staircase is small,long quasiseeds can be found in O(n).

Due to the Restricted-Quasiseeds Lemma,m = Θ(n) would suffice for the second condition.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 14/24

Page 39: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -factorization

A variant of a well known LZ-factorization

Definition (f -factorization)An f -factorization f1f2 . . . fk of w is constructedgreedily: fi is either just the first occurrence of aletter or the longest prefix of the remaining suffixthat is a substring of f1 . . . fi−1.

a a a a a a a a ab b b b b b c

Theorem (Crochemore, 1983; Crochemore et al.2009; Crochemore, Tischler 2011)The f -factorization over integer alphabet can becomputed in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 15/24

Page 40: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -factorization

Definition (f -factorization)An f -factorization f1f2 . . . fk of w is constructedgreedily: fi is either just the first occurrence of aletter or the longest prefix of the remaining suffixthat is a substring of f1 . . . fi−1.

a a a a a a a a ab b b b b b c

Theorem (Crochemore, 1983; Crochemore et al.2009; Crochemore, Tischler 2011)The f -factorization over integer alphabet can becomputed in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 15/24

Page 41: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -staircase

Intuition:Stairs lying within a factor are not needed.

In reduced staircase no stairs should overlap.

w :

> 4m

2m 2m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/24

Page 42: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -staircase

Intuition:Stairs lying within a factor are not needed.In reduced staircase no stairs should overlap.

w :

> 4m

2m 2m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/24

Page 43: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -staircase

Intuition:Stairs lying within a factor are not needed.In reduced staircase no stairs should overlap.

w :

> 4m

2m 2m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/24

Page 44: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -staircase

Intuition:Stairs lying within a factor are not needed.In reduced staircase no stairs should overlap.

w :

> 4m

2m 2m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/24

Page 45: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -staircase

Intuition:Stairs lying within a factor are not needed.In reduced staircase no stairs should overlap.

w :

> 4m

2m 2m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/24

Page 46: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -staircase

Intuition:Stairs lying within a factor are not needed.In reduced staircase no stairs should overlap.

w :

> 4m

2m 2m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/24

Page 47: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

f -staircase

Intuition:Stairs lying within a factor are not needed.In reduced staircase no stairs should overlap.

w :

> 4m

2m 2m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/24

Page 48: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Quasiseeds, staircase and factorization

LemmaLet F be the f -factorization of w (|w | = n) and vbe a quasiseed of w, |v | < n

16 . Then at most 2n|v | − 1

factors from F lie within [2n16 ,15n16 ].

2n16

n16g ≤ 2n|v | − 1 middle factors

The size of the f -staircase is at most3n16 + 4m(g + 1).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 17/24

Page 49: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Quasiseeds, staircase and factorization

LemmaLet F be the f -factorization of w (|w | = n) and vbe a quasiseed of w, |v | < n

16 . Then at most 2n|v | − 1

factors from F lie within [2n16 ,15n16 ].

2n16

n16g ≤ 2n|v | − 1 middle factors

The size of the f -staircase is at most3n16 + 4m(g + 1).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 17/24

Page 50: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Quasiseeds, staircase and factorization

LemmaLet F be the f -factorization of w (|w | = n) and vbe a quasiseed of w, |v | < n

16 . Then at most 2n|v | − 1

factors from F lie within [2n16 ,15n16 ].

2n16

n16g ≤ 2n|v | − 1 middle factors

The size of the f -staircase is at most3n16 + 4m(g + 1).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 17/24

Page 51: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Key lemmas

The algorithm does not know the quasiseed, butcan find the number of middle factors.

Let g be the number of middle factors of the wordw , |w | = n > 48.

LemmaThere is no quasiseed v of w such that:

2ng+1 < |v | ≤

n16 .

LemmaIf m ≤ n

16(g+1) then the size of the f -staircase is< n2 .

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 18/24

Page 52: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Key lemmas

The algorithm does not know the quasiseed, butcan find the number of middle factors.Let g be the number of middle factors of the wordw , |w | = n > 48.

LemmaThere is no quasiseed v of w such that:

2ng+1 < |v | ≤

n16 .

LemmaIf m ≤ n

16(g+1) then the size of the f -staircase is< n2 .

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 18/24

Page 53: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Key lemmas

The algorithm does not know the quasiseed, butcan find the number of middle factors.Let g be the number of middle factors of the wordw , |w | = n > 48.

LemmaThere is no quasiseed v of w such that:

2ng+1 < |v | ≤

n16 .

LemmaIf m ≤ n

16(g+1) then the size of the f -staircase is< n2 .

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 18/24

Page 54: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Final structure of the algorithm

1 Find an f -factorization and the number ofmiddle factors (g)

2 m :=⌊

n16(g+1)

⌋3 Compute the f -staircase4 Compute the long quasiseeds (belonging to two

ranges of fixed ratio)5 If m > 0 compute the short quasiseeds by

recursive calls and merge the results

Merging needs some more attention.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 19/24

Page 55: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Final structure of the algorithm

1 Find an f -factorization and the number ofmiddle factors (g)

2 m :=⌊

n16(g+1)

⌋3 Compute the f -staircase4 Compute the long quasiseeds (belonging to two

ranges of fixed ratio)5 If m > 0 compute the short quasiseeds by

recursive calls and merge the results

Merging needs some more attention.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 19/24

Page 56: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Representation of quasiseeds

An O(n) representation on the suffix tree.

v

a

a

abaaaaa

non-quasiseeds

quasiseeds

aaaa aaaaa aaaaaa aaaaaaab b b

aaa aaaaab

aaa aaab

A family of ranges represented by the lower end(explicit node) and the depth of the upper end.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 20/24

Page 57: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging

We have to find the intersection of the sets ofquasiseeds of some substrings of w .

The quasiseeds of each substring are reportedas a family of ranges on the suffix tree of thissubstring.

We want the result to be on the suffix tree ofthe whole w .No direct relation between (explicit) nodes ofthose suffix trees and the suffix tree of w .But relation between leaves is straightforward.Solution: represent a range by a leaf with apair of depths.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 21/24

Page 58: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging

We have to find the intersection of the sets ofquasiseeds of some substrings of w .

The quasiseeds of each substring are reportedas a family of ranges on the suffix tree of thissubstring.We want the result to be on the suffix tree ofthe whole w .

No direct relation between (explicit) nodes ofthose suffix trees and the suffix tree of w .But relation between leaves is straightforward.Solution: represent a range by a leaf with apair of depths.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 21/24

Page 59: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging

We have to find the intersection of the sets ofquasiseeds of some substrings of w .

The quasiseeds of each substring are reportedas a family of ranges on the suffix tree of thissubstring.We want the result to be on the suffix tree ofthe whole w .No direct relation between (explicit) nodes ofthose suffix trees and the suffix tree of w .

But relation between leaves is straightforward.Solution: represent a range by a leaf with apair of depths.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 21/24

Page 60: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging

We have to find the intersection of the sets ofquasiseeds of some substrings of w .

The quasiseeds of each substring are reportedas a family of ranges on the suffix tree of thissubstring.We want the result to be on the suffix tree ofthe whole w .No direct relation between (explicit) nodes ofthose suffix trees and the suffix tree of w .But relation between leaves is straightforward.

Solution: represent a range by a leaf with apair of depths.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 21/24

Page 61: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging

We have to find the intersection of the sets ofquasiseeds of some substrings of w .

The quasiseeds of each substring are reportedas a family of ranges on the suffix tree of thissubstring.We want the result to be on the suffix tree ofthe whole w .No direct relation between (explicit) nodes ofthose suffix trees and the suffix tree of w .But relation between leaves is straightforward.Solution: represent a range by a leaf with apair of depths.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 21/24

Page 62: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging cont’d

The (leaf, depth, depth) representation is goodfor transferring only.

We need to go back to the original one. It canbe done with the aid of the following queries

Ancestor QueriesGiven an integer h and a leaf v , find the highestexplicit ancestor of v , whose depth is ≥ h.

Any q such queries can be answered offline inO(n + q) time.Once we’re back in the original representation,finding the intersection is elementary.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 22/24

Page 63: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging cont’d

The (leaf, depth, depth) representation is goodfor transferring only.We need to go back to the original one. It canbe done with the aid of the following queries

Ancestor QueriesGiven an integer h and a leaf v , find the highestexplicit ancestor of v , whose depth is ≥ h.

Any q such queries can be answered offline inO(n + q) time.Once we’re back in the original representation,finding the intersection is elementary.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 22/24

Page 64: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging cont’d

The (leaf, depth, depth) representation is goodfor transferring only.We need to go back to the original one. It canbe done with the aid of the following queries

Ancestor QueriesGiven an integer h and a leaf v , find the highestexplicit ancestor of v , whose depth is ≥ h.

Any q such queries can be answered offline inO(n + q) time.

Once we’re back in the original representation,finding the intersection is elementary.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 22/24

Page 65: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Merging cont’d

The (leaf, depth, depth) representation is goodfor transferring only.We need to go back to the original one. It canbe done with the aid of the following queries

Ancestor QueriesGiven an integer h and a leaf v , find the highestexplicit ancestor of v , whose depth is ≥ h.

Any q such queries can be answered offline inO(n + q) time.Once we’re back in the original representation,finding the intersection is elementary.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 22/24

Page 66: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Conclusions

We have presented a linear algorithm for theAll-Quasiseeds Problem (over integeralphabet).This yields a linear algorithm for the All-SeedsProblem (over integer alphabet).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 23/24

Page 67: A Linear Time Algorithm for Seeds Computation€¦ · A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Thank you for your attention!

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 24/24