61
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Embed Size (px)

Citation preview

Page 1: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Master Course

MSc Bioinformatics for Health Sciences

H15: Algorithms on strings and sequences

Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Dep. de Llenguatges i Sistemes InformàticsCEPBA-IBM Research Institute

Universitat Politècnica de Catalunya

Page 2: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Contents

1. (Exact) String matching of one pattern

2. (Exact) String matching of many patterns

3. Approximate string matching (Dynamic programming)

4. Pairwise and multiple alignment

5. Suffix trees

Page 3: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Contents and bibliography

1. (Exact) String matching of one pattern

2. (Exact) String matching of many patterns

3. Approximate string matching (Dynamic programming)

4. Pairwise and multiple alignment

5. Suffix trees

• Flexible pattern matching in stringsG. Navarro and M. Raffinot, 2002, Cambridge Uni. Press

• Algorithms on strings, trees and sequencesD. Gusfield, Cambridge University Press, 1997

Page 4: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String matching

Definition: given a long text T and a set of k patterns p1,p2,…,pk, the string matching problem is to find

all the ocurrences of all the patterns in the text T.

On-line algorithms: the patterns are known.

Off-line algorithms: the text is known.

• Only one pattern (exact and approximated)• Five, ten, hundred, thusand,.. patterns (exact)• Extended patterns

• Suffix trees

Page 5: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Master Course

First lecture:

First part:

(Exact) string matching of one pattern

Page 6: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String matching: one pattern

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA.

How does the string algorithms made the search?

and for the pattern TACTACGGTATGACTAA

Page 7: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: Brute force algorithm

Given the pattern ATGTA, the search is

G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A

A T G T A

A T G T A

Example:

Page 8: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

What is the meaning of the variables?

y: n:

x: m:

String Matching: Brute force algorithm

Connect to

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open Brute Force algorithm

What is the meaning of the variables?

y: array with the text T n: length of the text

x: array with the pattern P m:length of the pattern

C code of the running file

Connect to

http://www.lsi.upc.edu/~peypoch

Page 9: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching of one pattern

The cost of Brute Force algorithm is O(nm).

Can the search be made with lower cost?

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

TACTACGGTATGACTAA

Factor search

Prefix search

Suffix search

Page 10: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String matching of one pattern

How does the string algorithms made the search?

There is a sliding window along the text against which the pattern is compared:

Pattern :

Text :

Which are the facts that differentiate the algorithms?

1. How the comparison is made.2. The length of the shift.

At each step the comparison is made and the window is shifted to the right.

Page 11: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: Brute force algorithm

Text :

Patern :

From left to right: prefix search

• Which is the next position of the window?

• How the comparison is made?

Patró :

Text :

The window is shifted only one cell

The cost is O(mn).

Page 12: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: one pattern

Most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Length of the pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 13: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: Horspool algorithm

Text :

Pattern :From right to left: suffix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text : a

It depends of where appears the last letter of the text, say it ‘a’, in the pattern:

a a a

Then it is necessary a preprocess that determines the length of the shift.

aa a

a a a

Page 14: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: Horspool algorithm

Given the pattern ATGTA, the shift table is A 4C 5G 2T 1

And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A A T G T A

A T G T A

Example:

Page 15: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: Horspool algorithm

Given the pattern ATGTA, the shift table is A 4C 5G 2T 1

And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A A T G T A

A T G T A A T G T A

Example:

Page 16: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: Horspool algorithm

Connect to

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open the Horspool algorithm

C code

Connect to

http://www.lsi.upc.edu/~peypoch

Page 17: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: one pattern

The most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Length of the pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 18: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

Text :

Pattern :

Searches for suffixes of T that are factors of P

This state is expressed with an array D of bits:

D2 = 1 0 0 0 1 0 0

How the next state can be obtained?

D = D<<1 & B(x)

Given the mask B(x) of x, the cells where character x appears into the pattern

D3 = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )

If B(x) = ( 0 0 1 1 0 0 0) then

?

x

Page 19: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BNDM algorithm: example

Given the pattern ATGTA,

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = B(G) = B(T) =

Page 20: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BNDM algorithm: example

Given the pattern ATGTA,

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

Page 21: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BNDM algorithm: example

Given the pattern ATGTA,

Given the text :G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

D1 = = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )

D1 = = ( 0 0 1 0 0 )D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )

D1 = = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )

Page 22: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BNDM algorithm: example

A T G T A

The pattern is ATGTA ,

the masks are:

and the text:G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

D1 = = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 )

Pattern found!

Page 23: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Text :

Pattern :

Searches for suffixes of T that are factors of P

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

This state is expressed with an array D of bits:

D = 1 0 0 0 1 0 0

?

Page 24: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Text :

Pattern :

Searches for suffixes of T that are factors of P

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

This state is expressed with an array D of bits:

D = 1 0 0 0 1 0 0

If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells

Page 25: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String matching: one pattern

The most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. patró

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

Page 26: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM (Backward Oracle Matching)

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle(1999)

Checks if the suffix is a factor of the pattern

?

Page 27: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Automaton Factor Oracle: properties

Factor Oracle of the word G T A T G T A

GG AT T ATTA

G

G T A T G

but the automaton also recognizes other strings as G T G

then it is usefull only for discard words out as factors!

A T G

G T G

T A T G

Page 28: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

Page 29: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G

Page 30: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

Page 31: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G

Page 32: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G A T G T A T G

Page 33: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM: example

• Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G A T G T A T G

A T G T A T G …

Page 34: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

BOM (Backward Oracle Matching)

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle

Checks if the suffix is a factor of the pattern

a

• a is the first mismatch

Page 35: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String Matching: BNDM and BOM

Connect to

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open the BNDM and BOM algorithms

C code of BNDM C code of BOM

Page 36: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Master Course

First lecture:

Second part:

(Exact) string matching of many patterns

Page 37: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String matching: many patterns

Given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

Search for the patterns

ACTGACTGTCTAATT

ACTGATCTTTGTAGCAATACTACATGCACTGA.

Page 38: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Trie

Trie of words GTATGTA,GTAT,TAATA,GTGTA

T A

A

G

G

AT

TT

T

G

A

A

AA T

Which is the cost?

Page 39: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

4. Start the search

T A

A

G

GA

TTT

T

G

A

A

AA T

1. Build the trie of the inverted patterns

2. lmin=4A 1C 4 (lmin)G 2T 1

3. Table of shifts

Page 40: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 41: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 42: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 43: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 44: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 45: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Page 46: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Short Shifts!

Page 47: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…

2 símbols

Horspool to Wu-Manber

How do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG

AA 1AT 1GT 1TA 2TG 2

A 1C 4 (lmin)G 2T 1

1 símbol

Page 48: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Wu-Manber algorithm

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

into the text: ACATGCTATGTGACATAATA

AA 1AT 1GT 1TA 2TG 2

Experimental length: log|Σ| 2*lmin*r

Page 49: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String matching of many patterns

5 10 15 20 25 30 35 40 45

8

4

2

| |

Wu-Manber

SBOMLmin

(5 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

(100 patterns)

Page 50: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

String matching of many patterns

5 10 15 20 25 30 35 40 45

8

4

2

| |

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

SBOM

Lmin

(5 patterns)

(10 patterns)

(100 patterns)(1000 patterns)

Page 51: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle

Checks if the suffix is a factor of any pattern

?

Page 52: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Factor Oracle of many patterns

The AFO of GTATGTA, GTAA, TAATA i GTGTA

T A

A

GG AT TT

T

A

G

A

1,4

32

A

Page 53: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm

Text :

Patrons:

• How the shift is determined?

• How the comparison is made?

a

Autòmaton………… of lenght lmin

• If the a doesn’t appears in the AFO

• If lmin characters have been read

Page 54: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm : example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 55: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 56: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 57: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 58: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 59: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

Page 60: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGT…

A

Page 61: Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (alggen)

Alg. Cerca exacta de molts patrons

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMLong. mínima

(5 mots)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 mots)

Ad AC