Indexing Biological Sequence Data

Indexing Biological Sequence Indexing Biological Sequence DataData

Doctoral Seminarby

Mihail R. Halachev

Supervisor: Dr. N. Shiri

Dept. of Computer Science and Software EngineeringConcordia University

11/29/2004

2

OutlineOutline

Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work

Source: National Health Museum

3

From DNA to sequence data From DNA to sequence data representationrepresentation

The 2 strands are complementary:

A TC G

A DNA segment can be encoded using the bases

from only one of the strands:

S = AGTACG Σ = {A, C, G, T}

Source: Wikipedia 4

From mRNA to sequence data From mRNA to sequence data representationrepresentation

Each codon specifies a single amino acid.S = ATGLRS*

|Σ’| = 20

5

OutlineOutline


6

Basic tasks over biological dataBasic tasks over biological data

From a biological point of view: Having a novel DNA sequence, perform a search in primary

biological DBs for similar (already known) sequences. Similarity (Alignment) Homology

Compare a novel protein sequence to secondary protein DBs containing motifs, signatures, protein domains, etc.

Approximation of the biochemical function of the query protein

From a computational point of view:

- both tasks are essentially searching

7

OutlineOutline


8

Search techniques for Search techniques for sequence biological data sequence biological data

(BLAST, Clustal W)(BLAST, Clustal W)

Basic Local Alignment Search Tool (BLAST) [Altschul ‘90, ‘97]

The NCBI BLAST family of programs includes:

blastp - an amino acid query against a protein DB

blastn - a nucleotide query against a nucleotide DB

blastx - a nucleotide query (in all reading frames) against a protein DB

tblastn - a protein query against a nucleotide DB (in all reading frames)

tblastx - the six-frame translations of a nucleotide query against the six- frame translations of a nucleotide DB

9

How BLAST works?How BLAST works?

Local pairwise alignment• The BLAST algorithm is a heuristic search method that seeks words of length W that score at least T when aligned with the query and scored with a substitution matrix. • Words in the database that score T or greater are extended in both directions in an attempt to find a alignment to produce a HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. • T parameter values: a trade-off between speed and sensitivity of the search.

Source: National Center for Biotech Info

10

BLAST Case Study [Hunt ‘01]BLAST Case Study [Hunt ‘01]

Hardware:SUN Enterprise 450, 2 GB RAM, 4 Processors, Solaris 7

Software:BLAST (with default parameter settings)

Data:3 human chromosomes (294 Mbp, 10% of human genome),data on local disks

Queries:99 query sequences (predicted human genes), with length between 429 to 5999 bp

Results:6559 hits, average 66 hits per query.

Time: 62 hours

11

BLAST ObservationsBLAST Observations

“BLAST: - performs serial scan of the DB; - is CPU intensive; - its usefulness depends on the biologists being able to provide appropriate search parameters values.”

[Hunt ‘01]

“Filtering approaches, like BLAST, are only suitable for high similarity matching, but often low similarities are biologically significant.”

[Navarro ‘00a]

12

Clustal W [Thompson ‘94]Clustal W [Thompson ‘94]

Dynamic Programming alignment method Based on global multiple alignment

Input : set of N sequences Output : the optimal alignment of N sequences

Improved sensitivity (may find similar sequences which BLAST may omit)

50-100 times slower than BLAST

13

Motivation for Indexing?Motivation for Indexing?

“Many of these biological datasets are growing at exponential rates – for example, the sizes of the sequence datasets in GenBank have been doubling every sixteen months.”

[Tata ‘04]

“As there is a rapid rise in both the volume of data and the demand for searches by researchers investigating functional genomics, it is worth investigating the possibility of accelerating these searches using indexes.” [Hunt ‘01]

14

OutlineOutline


15

Indexing Techniques for Sequence Indexing Techniques for Sequence DataData

Q-grams [Navaro ‘98]

String B-Tree [Ferragina ‘99]

Multi-D Index [Jagadish ‘00]

Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata

‘04]

16

Q-grams -- ConstructionQ-grams -- Construction

Input: T is a text over Σ, |T| = n, |Σ| = σ

Pick an integer, say q = 4 (0 < q < n, a good heuristic is q ≈ log

σn)

Each substring of T with size q is called a “q-gram” and is stored in the index table (in lexical order) with a list of pointers to positions (or blocks) in T where this q-gram occurs

17

Q-grams -- SearchingQ-grams -- Searching

For a pattern P, |P| = m,

Find all approximate occurrences P’ of P in T, where error ratio of each P’ ≤ λ

λ = k / m, where k is the edit distance of P’ to P Knowing m and the desired λ, compute k Split P at k +1 disjoint pieces Having k +1 disjoint pieces of P,

for each of them search the index table (binary search)

Set of candidate matches is the union of all occurrences

Verify each candidate by neighborhood search

18

Q-grams -- ExampleQ-grams -- Example

27

26

25

24

23

22

21

20

19

18

17

16

15

14

13

12

11

10

987654321

nactcartnocoiratnoractabmoc

27

26

25

24

23

22

21

20

19

18

17

16

15

14

13

12

11

10

987654321

nactcartnocoiratnoractabmoc

3mba

9ron15ioc

14rio23ctc

21rac17con

20tra1com

6, 24tca7car

12tar25can

10, 18ont4bat

2omb5atc

16oco8aro

19ntr13ari

11nta26an

27n22act

T[pos]q-gramsT[pos]q-grams

3mba

9ron15ioc

14rio23ctc

21rac17con

20tra1com

6, 24tca7car

12tar25can

10, 18ont4bat

2omb5atc

16oco8aro

19ntr13ari

11nta26an

27n22act


T =

Set q = 3,

Index Table:

19

Q-grams -- ExampleQ-grams -- Example

Search for P = con, k = 1 (i.e. allow only one error), split P in k+1 pieces: P1 = c and P2 = on

3mba

9ron15ioc

14rio23ctc

21rac17con

20tra1com

6, 24tca7car

12tar25can

10, 18ont4bat

2omb5atc

16oco8aro

19ntr13ari

11nta26an

27n22act


3mba

9ron15ioc

14rio23ctc

21rac17con

20tra1com

6, 24tca7car

12tar25can

10, 18ont4bat

2omb5atc

16oco8aro

19ntr13ari

11nta26an

27n22act

T[pos]q-gramsT[pos]q-grams• Candidate MatchesP1 = c : 25, 7, 1, 17, 23P2 = on : 10, 18

• Verification (1 error allowed)con ? bat con ? cancon ? carcon ? comcon ? concon ? ctccon ? ioccon ? ombcon ? ontcon ? tar

Answer:T[25], T[1], T[17]

+T[9], T[17]

20






‘04]

21

String B-TreeString B-Tree -- -- Construction Construction

Input: S = {aid, atom, attenuate, car, patent, zoo, atlas}Step 1. Store S consequently on disk.

Input: set of words

Step 2. Sort lexicographically each suffix of each word

Lexicographic Order“aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]

Step 3. Create leaf nodes.Each node contains pointers to the sorted suffixes.

1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31

Step 4. Propagate LMP and RMP from each node up, until construct root

1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31

1 10 20 8 28 39 29 31

1 8 28 31

22

Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them.


1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31

1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31

1 10 20 8 28 39 29 31

1 8 28 31

String B-TreeString B-Tree -- -- ConstructionConstruction

23

Each node is implemented as modified Patricia Trie.


1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31

1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31

1 10 20 8 28 39 29 31

1 8 28 31

String B-TreeString B-Tree -- -- ConstructionConstruction

1 16 25 10

aid

ate

atent

attenuate

0

5

3

1

3

2

9

i

a

n

e

t

t

24

String B-Tree -- SearchingString B-Tree -- Searching

Find all occurrences of P = te in S

Start at root:t > n and t < z branch right

1 8 28 31 0

3

1 23

a zm n

a i d

m nt

z o o

1 10 20 8 28 39 29 31

1 8 28 31 00

33

11 2233

a zm n

a i d

m nt

z o o

1 10 20 8 28 39 29 31

Child Node:t ≥ t and t < z branch right

28 39 29 31

2

0

1 13

n zs t

n t

s t z o o

29 12 36 3128 7 32 39

28 39 29 31

2

00

11 1133

n zs t

n t

s t z o o

29 12 36 3128 7 32 39

Child Node:te ≥ te and te < tl branch left

29 12 36 31

t t e

t l a s

z o o

0

2

41

3l

z

e

t

12

29

36

29 17 26 1236 6 11 15 31

29 12 36 31

t t e

t l a s

z o o

00

22

4411

33l

z

e

t

12

29

36

29 17 26 1236 6 11 15 31

Leaf node:P = te found at:S[17,18]S[26,27]S[12,13]

29 17 26 120

2

7

1

3

n

u

e

t

t t e

t e n t

t e n u a t e

S[17]

S[29]

S[12]4

t

S[26]

29 17 26 1200

22

77

11

33

n

u

e

t

t t e

t e n t

t e n u a t e

S[17]

S[29]

S[12]44

t

S[26]

25






‘04]

26

Multi-D Index -- ConstructionMulti-D Index -- Construction

Dimension X

Dimension Y

abce magh

abcd makk

abqs makk

abqs mdbc

alaa magz

almn mazz

abqa maza

abzz mdyz

Input: A set of pairs of strings(not necessarily of same length)

#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#

0 5 10

15

20

25

30

35

40

45

50

55

60

65

70

75


00 55 1010

1515

2020

2525

3030

3535

4040

4545

5050

5555

6060

6565

7070

7575

Step 1.Store the pairs of strings

(separated properly) consequently on disk

27


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75

Step 2. Create index leaf nodes, storing pointers to separating symbolsStep 3. Construct internal nodes (until construct root).

R-trees and MBR computation are used for building up the index.

10 20 5 35 60 50 45 75

MBR1 MBR2


0 5 10

15

20

25

30

35

40

45

50

55

60

65

70

75


00 55 1010

1515

2020

2525

3030

3535

4040

4545

5050

5555

6060

6565

7070

7575

28


Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them.

At each node, for each dimension,

create an ‘Elided Trie’. E-tries are very similar to Patricia Tries.

For searches, use the E-Tries in a similar manner as the Patricia Tries (during the downward traversal of the index tree).

29



0 5 10

15

20

25

30

35

40

45

50

55

60

65

70

75


00 55 1010

1515

2020

2525

3030

3535

4040

4545

5050

5555

6060

6565

7070

7575

30

Multi-D Index -- Multi-D Index -- SearchingSearching

Prefix Search:Q1=(abc*,makk*)

Start at root E-Tries repeat {

x-dim: abc* can only be on left MBR

y-dim: makk* can be in both MBRs

Compute the intersection examine only left MBR

….. until reach a leaf index node….

}

Step k (leaf page) {//compute candidatesx-dim: string pair @ 0 string pair @ 10

y-dim: string pair @ 10 string pair @ 20

Answer to query = the intersection

}


0 5 10

15

20

25

30

35

40

45

50

55

60

65

70

75


00 55 1010

1515

2020

2525

3030

3535

4040

4545

5050

5555

6060

6565

7070

7575

31






‘04]

32

Suffix Tree [Gusfield ‘97]Suffix Tree [Gusfield ‘97]

A Suffix Tree for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m.

Each internal node (except the root) has at least 2 children and each edge is labeled with a nonempty substring of S.

No 2 edges out of a node can have edge-labels beginning with the same character.

The key feature of the Suffix Tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i, i.e., S [i..m].

33

Suffix TreeSuffix Tree

Input: string S = xabxa, add $ at the end (no suffix of S is a prefix of another suffix).

$

ab

xa

2

b x a $3

$

4

5

$

6 $

$

x

xa

b

a

1

Suffix Tree forS = xabxa$

34

Suffix Tree -- SearchingSuffix Tree -- Searching

1 2 3 4 5 6

x a b x a $

Find all occurrences of P = xa in S

$

ab

xa

2

b x a $3

$

4

5

$

6 $

$

x

xa

b

a

1 S =

35

Generalized Suffix TreeGeneralized Suffix Tree

ST can be build for more than one string.1 2 3 4 5 6

x a b x a $

1 2 3 4 5

b x a d $

S1 =

S2 =

b x a $ 3,1

$

4,1 5,1

$

6,1 $

$

ab

xa

2,1

$x

xa

ba

1,1

5,2d

$1,2

d$

2,2

d$

3,2

$d

4,2

36

OutlineOutline


37

Desired for the Indexing Desired for the Indexing TechniqueTechnique

Relatively fast construction, reasonable amount of storage consumption (persistently stored);

Allows huge sequences to be indexed; Supports versatile queries over data;

+

Supports bioinformatics applications!

38

Applicability for Sequence Biological Data

Data Structure Suitable for bio-data indexing?

Q-gramsYesBLAST is using very similar idea. Provides high similarity matching, suitable for some bioinformatics applications.

String B-TreeYes? A DNA sequence cannot be broken into words, but can we exploit the repeats?

Multi-D IndexYes?Can we view promoters, genes, exons, introns, etc. as attributes in a DB?

Suffix TreeYes?Slow construction, limited input sequence size, size of index ≈ 10x size of input, but supports versatile queries over data

39

OutlineOutline


40

Suffix Trees: A closer lookSuffix Trees: A closer look

Suffix Trees are well known in the biological sequence processing field

Recent advances in Suffix Tree construction algorithms

Suffix Trees provide support for answering

versatile biological questions

41

Suffix Tree (ST) ApplicationsSuffix Tree (ST) Applications

REPuter [Kurtz ‘99]The REPuter program family provides state of the art software solutions to compute and visualize repeats in whole genomes or chromosomes.

MUMmer [Delcher ‘99, ‘02, ‘04]MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. NUCmer program aligns contigs from a shotgun sequencing project to another set of contigs or a genome.

42

ST Construction Algorithms ST Construction Algorithms HistoryHistory

[Weiner ‘73] First linear time algorithm to build Suffix Tree (called Position Tree).

[McCreight ‘76] A more space efficient solution.

[Ukkonen ‘95] Presents a variation of [McCreight ‘76], but much easier to understand, to prove bounds, and to implement.

All these algorithms are in-memory algorithms. In practice, the sequences to be indexed are large, they cannot fit in the memory; the corresponding ST is ≈ 10x bigger.

43

Advances in ST Construction Advances in ST Construction AlgorithmsAlgorithms

[Hunt ‘01]Abandons the use of the suffix links (the algorithm is not linear any more), presents the idea of partitioning to reduce the number of disk I/O’s

[Giegerich ‘03]Proposes a space efficient representation of ST.

[Tata ‘04]Extends ideas in [Hunt ‘01] and [Giegerich ‘03], focuses on development of an efficient buffering strategy.

[Tata ‘04] builds a ST on the entire human genome (approx. 3 Gbp)

in 30 hours, using a single processor machine;

even for the in-memory case [Tata ‘04 - O(m2)], performs better than [Ukkonen ‘95 - O(m)]

44

Versatile Biological Support by Versatile Biological Support by STST

Exact search (with or without wild cards)

Approximate search

[Longest] Common substring/subsequence of 2 (or more) strings Recognizing DNA contamination Alignment

[Shortest] Superstring of 2 (or more) strings Shotgun sequencing and sequence assembly

Finding repeats in a single sequence

Compressing DNA strings to study the information content of a string or to discriminate between exons and introns in eukaryotic DNA

….

45

Suffix Tree RepresentationsSuffix Tree Representations

Suffix Array [Manber ‘93, Myers ‘94, Baeza-Yates ‘00]

LC-tries [Anderson ‘95]

Suffix Binary Search Tree [Irving ‘03]

46

OutlineOutline


47

ConclusionConclusion

BLAST Case Study

Observations on existing searching techniques

Alternative indexing techniques for sequence data and their possible application for biological sequence data

Suffix Trees

48

Future WorkFuture Work

Suffix Tree Construction Further improvements of [Tata ‘04] algorithm – time/space Combining of two (or more) Suffix Trees Suffix Tree maintenance

Suffix Tree Usage Most of the widely known ST-based algorithms rely on the

suffix links. How the algorithms that use ST will change in the absence of suffix links?

Potential of ST for mining biodata

Alternative Index Data Structures“Families of reiterated sequences account for about one third of the human genome.” [McConkey ‘93]

49

ReferencesReferences

[Altschul ‘90] S.F. Altschul et al. “Basic local alignment search tool”. J. Mol. Biol., 215:403-10, 1990.[Altschul ‘97] S. F. Altschul, T. L. Madden, A. A. Schaeer, J. Zhang, Z. Zhang, W. Miller, and D. J.

Lipman. “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”. Nucleic Acids Research, 25:3389-3402, 1997.

[Anderson ‘95] A. Andersson and S. Nilsson. “Efficient implementation of suffix trees”. Softw. Pract. Exp., 25(2):129-141, 1995

[Baeza-Yates ‘00] R. Baeza-Yates and G. Navarro. “A Hybrid Indexing Method for Approximate String Matching”. Journal of Discrete Algorithms, 2000.

[Delcher ‘99] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. “Alignment of Whole Genomes”. Nucleic Acids Research, 27:2369-2376, 1999.

[Ferragina ‘99] P. Ferragina and R. Grossi. “The string B-tree: a new data structure for string search in external memory and its applications”. Journal of the ACM, 46(2):236-280, 1999

[Giegerich ‘03] R. Giegerich, S. Kurtz, and J. Stoye. “Efficient implementation of lazy suffix trees”. Softw. Pract. Exper. 2003; 33:1035-1049, 2003

[Gusfield ‘97] D. Gusfield. “Algorithms on strings, trees and sequences : computer science and computational biology”. Cambridge University Press, 1997

[Hunt ‘01] E. Hunt, M.P. Atkinson, and R.W. Irving. “A Database Index to Large Biological Sequences”. In VLDB J., 7(3):139-148, 2001

[Irving ‘03] R.W. Irving and L. Love. “The Suffix Binary Search Tree and Suffix AVL Tree”. Journal of Discrete Algorithms, 1 (2003) 387–408, 2003.

[Jagadish ‘00] H.V. Jagadish, Nick Koudas, and Divesh Srivastava. “On effective multi-dimensional indexing for strings”. In ACM SIGMOD Conference on Management of Data, pages 403-414, 2000.

50

ReferencesReferences

[Kurtz ‘99] S. Kurtz and C. Schleiermacher. “REPuter: fast computation of maximal repeats in complete genomes”. Bioinformatics, pages 426-427, 1999

[Manber ‘93] U. Manber and G. Myers. “Suffix arrays: a new method for on-line string searches”. SIAM J. Comput., 22(5):935-948, 1993.

[McConkey ‘93] E. McConkey. “Human Genetics: The Molecular Revolution”. Jones and Bartlett, Boston, MA, 1993

[McCreight ‘76] E.M. McCreight. “A Space-economical Suffix Tree Construction Algorithm”. J. ACM, 23(2):262-272, 1976

[Myers ‘94] E. W. Myers. “A sublinear algorithm for approximate key word searching”. Algorithmica,12(4/5):345-374, 1994.

[Navarro ‘98] G. Navarro and R. Baeza-Yates. “A practical q-gram index for text retrieval allowing errors”. CLEI Electronic Journal, 1(2), 1998

[Navarro ‘00a] G. Navarro. “A Guided Tour to Approximate String Matching”. ACM Computing Surveys,33:1:31-88, 2000.

[Navarro ‘00b] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. “Indexing Text with Approximate q-grams”. In CPM2000, LNCS 1848, pages 350-365, 2000

[Tata ‘04] S. Tata, R.A. Hankins, and J. Patel. “Practical Suffix Tree Construction”. In Proc. of the 30th VLDB, 2004

[Thompson ‘94] J. D. Thompson, D. G. Higgins, and T. J. Gibson. “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice”. In Nucleic Acids Research, Vol. 22, No. 22 4673-4680, 1994

[Ukkonen ‘95] E. Ukkonen. “On-line construction of suffix-trees”. Algorithmica 14 (1995), 249-260, 1995

[Weiner ‘73] P. Weiner. “Linear Pattern Matching Algorithms”. In Proc. of the 14th Annual Symposium on Switching and Automata Theory, 1973

51

Thank You!Thank You!

Documents

Indexing Biological Sequence Data