Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier...

Preview:

Citation preview

Master Course

MSc Bioinformatics for Health Sciences

H15: Algorithms on strings and sequences

Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Dep. de Llenguatges i Sistemes InformàticsCEPBA-IBM Research Institute

Universitat Politècnica de Catalunya

Contents

1. (Exact) String matching of one pattern

2. (Exact) String matching of many patterns

3. Approximate string matching (Dynamic programming)

4. Pairwise and multiple alignment

5. Suffix trees

Contents and bibliography

1. (Exact) String matching of one pattern

2. (Exact) String matching of many patterns

3. Approximate string matching (Dynamic programming)

4. Pairwise and multiple alignment

5. Suffix trees

• Flexible pattern matching in stringsG. Navarro and M. Raffinot, 2002, Cambridge Uni. Press

• Algorithms on strings, trees and sequencesD. Gusfield, Cambridge University Press, 1997

String matching

Definition: given a long text T and a set of k patterns p1,p2,…,pk, the string matching problem is to find

all the ocurrences of all the patterns in the text T.

On-line algorithms: the patterns are known.

Off-line algorithms: the text is known.

• Only one pattern (exact and approximated)• Five, ten, hundred, thusand,.. patterns (exact)• Extended patterns

• Suffix trees

Master Course

First lecture:

First part:

(Exact) string matching of one pattern

String matching: one pattern

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA.

How does the string algorithms made the search?

and for the pattern TACTACGGTATGACTAA

String Matching: Brute force algorithm

Given the pattern ATGTA, the search is

G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A

A T G T A

A T G T A

Example:

What is the meaning of the variables?

y: n:

x: m:

String Matching: Brute force algorithm

Connect to

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open Brute Force algorithm

What is the meaning of the variables?

y: array with the text T n: length of the text

x: array with the pattern P m:length of the pattern

C code of the running file

Connect to

http://www.lsi.upc.edu/~peypoch

String Matching of one pattern

The cost of Brute Force algorithm is O(nm).

Can the search be made with lower cost?

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

TACTACGGTATGACTAA

Factor search

Prefix search

Suffix search

String matching of one pattern

How does the string algorithms made the search?

There is a sliding window along the text against which the pattern is compared:

Pattern :

Text :

Which are the facts that differentiate the algorithms?

1. How the comparison is made.2. The length of the shift.

At each step the comparison is made and the window is shifted to the right.

String Matching: Brute force algorithm

Text :

Patern :

From left to right: prefix search

• Which is the next position of the window?

• How the comparison is made?

Patró :

Text :

The window is shifted only one cell

The cost is O(mn).

String Matching: one pattern

Most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Length of the pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

String Matching: Horspool algorithm

Text :

Pattern :From right to left: suffix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text : a

It depends of where appears the last letter of the text, say it ‘a’, in the pattern:

a a a

Then it is necessary a preprocess that determines the length of the shift.

aa a

a a a

String Matching: Horspool algorithm

Given the pattern ATGTA, the shift table is A 4C 5G 2T 1

And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A A T G T A

A T G T A

Example:

String Matching: Horspool algorithm

Given the pattern ATGTA, the shift table is A 4C 5G 2T 1

And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A A T G T A

A T G T A A T G T A

Example:

String Matching: Horspool algorithm

Connect to

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open the Horspool algorithm

C code

Connect to

http://www.lsi.upc.edu/~peypoch

String Matching: one pattern

The most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Length of the pattern

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

Text :

Pattern :

Searches for suffixes of T that are factors of P

This state is expressed with an array D of bits:

D2 = 1 0 0 0 1 0 0

How the next state can be obtained?

D = D<<1 & B(x)

Given the mask B(x) of x, the cells where character x appears into the pattern

D3 = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )

If B(x) = ( 0 0 1 1 0 0 0) then

?

x

BNDM algorithm: example

Given the pattern ATGTA,

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = B(G) = B(T) =

BNDM algorithm: example

Given the pattern ATGTA,

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

BNDM algorithm: example

Given the pattern ATGTA,

Given the text :G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

A T G T A

A T G T A

A T G T A

the mask of characters is:

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

D1 = = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )

D1 = = ( 0 0 1 0 0 )D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )

D1 = = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )

BNDM algorithm: example

A T G T A

The pattern is ATGTA ,

the masks are:

and the text:G T A C T A G A G G A C G T A T G T A C T G ...A T G T A

B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )

D1 = = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 )

Pattern found!

Text :

Pattern :

Searches for suffixes of T that are factors of P

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

This state is expressed with an array D of bits:

D = 1 0 0 0 1 0 0

?

Text :

Pattern :

Searches for suffixes of T that are factors of P

BNDM algorithm

• How the shift is determined?

• How the comparison is made?

This state is expressed with an array D of bits:

D = 1 0 0 0 1 0 0

If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells

String matching: one pattern

The most efficient algorithms (Navarro & Raffinot)

2 4 8 16 32 64 128 256

64

32

16

8

4

2

| |

Long. patró

Horspool

BNDMBOM

BNDM : Backward Nondeterministic Dawg Matching

BOM : Backward Oracle Matching

w

BOM (Backward Oracle Matching)

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle(1999)

Checks if the suffix is a factor of the pattern

?

Automaton Factor Oracle: properties

Factor Oracle of the word G T A T G T A

GG AT T ATTA

G

G T A T G

but the automaton also recognizes other strings as G T G

then it is usefull only for discard words out as factors!

A T G

G T G

T A T G

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G

BOM: example

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G A T G T A T G

BOM: example

• Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG

• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G

• How the comparison is made?

GG AT T ATTA

G

A T G T A T G A T G T A T G

A T G T A T G A T G T A T G

A T G T A T G …

BOM (Backward Oracle Matching)

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle

Checks if the suffix is a factor of the pattern

a

• a is the first mismatch

String Matching: BNDM and BOM

Connect to

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open the BNDM and BOM algorithms

C code of BNDM C code of BOM

Master Course

First lecture:

Second part:

(Exact) string matching of many patterns

String matching: many patterns

Given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

Search for the patterns

ACTGACTGTCTAATT

ACTGATCTTTGTAGCAATACTACATGCACTGA.

Trie

Trie of words GTATGTA,GTAT,TAATA,GTGTA

T A

A

G

G

AT

TT

T

G

A

A

AA T

Which is the cost?

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

4. Start the search

T A

A

G

GA

TTT

T

G

A

A

AA T

1. Build the trie of the inverted patterns

2. lmin=4A 1C 4 (lmin)G 2T 1

3. Table of shifts

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Short Shifts!

AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…

2 símbols

Horspool to Wu-Manber

How do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG

AA 1AT 1GT 1TA 2TG 2

A 1C 4 (lmin)G 2T 1

1 símbol

Wu-Manber algorithm

Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

into the text: ACATGCTATGTGACATAATA

AA 1AT 1GT 1TA 2TG 2

Experimental length: log|Σ| 2*lmin*r

String matching of many patterns

5 10 15 20 25 30 35 40 45

8

4

2

| |

Wu-Manber

SBOMLmin

(5 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

(100 patterns)

String matching of many patterns

5 10 15 20 25 30 35 40 45

8

4

2

| |

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

SBOM

Lmin

(5 patterns)

(10 patterns)

(100 patterns)(1000 patterns)

SBOM

• How the shifted is determined?

• How the comparison is made?

Text :

Pattern : Automaton: Factor Oracle

Checks if the suffix is a factor of any pattern

?

Factor Oracle of many patterns

The AFO of GTATGTA, GTAA, TAATA i GTGTA

T A

A

GG AT TT

T

A

G

A

1,4

32

A

SBOM algorithm

Text :

Patrons:

• How the shift is determined?

• How the comparison is made?

a

Autòmaton………… of lenght lmin

• If the a doesn’t appears in the AFO

• If lmin characters have been read

SBOM algorithm : example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGTATG

A

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG

GG AT TTTA

G A

T A

A1 4

2 3

ACATGCTAGCTATAATAATGT…

A

Alg. Cerca exacta de molts patrons

5 10 15 20 25 30 35 40 45

8

4

2

| |Wu-Manber

SBOMLong. mínima

(5 mots)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (1000 mots)

Ad AC

5 10 15 20 25 30 35 40 45

8

4

2

Wu-ManberSBOM

(100 mots)

Ad AC

Recommended