22
23.11.16 1 Text Algorithms Jaak Vilo 2016 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo Topics Exact matching of one pattern(string) Exact matching of multiple patterns Suffix trie and tree indexes Applications Suffix arrays Inverted index Approximate matching Algorithms One-pattern Brute force Knuth-Morris-Pratt Karp-Rabin Shift-OR, Shift-AND Boyer-Moore Factor searches Regular expressions(?) Weight matrices(?) Multi-pattern Aho Corasick Commentz-Walter Indexing Trie (and suffix trie) Suffix tree Exact pattern matching S=s 1 s 2… s n (text) |S| = n (length) P=p 1 p 2 ..p m (pattern) |P| = m Σ - alphabet | Σ| = c Does S contain P? Does S = S' P S" fo some strings S' ja S"? Usually m << n and n can be (very) large Find occurrences in text S P Animations http://www-igm.univ-mlv.fr/~lecroq/string/ EXACT STRING MATCHING ALGORITHMS Animation in Java Christian Charras - Thierry Lecroq Laboratoire d'Informatique de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-Saint-Aignan Cedex FRANCE e-mails: {Christian.Charras, Thierry.Lecroq}@laposte.net

Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

1

TextAlgorithms

JaakVilo2016fall

1MTAT.03.190TextAlgorithmsJaakVilo

Topics

• Exactmatchingofonepattern(string)• Exactmatchingofmultiplepatterns• Suffixtrie andtreeindexes

– Applications• Suffixarrays• Invertedindex• Approximatematching

Algorithms

One-pattern• Bruteforce• Knuth-Morris-Pratt• Karp-Rabin• Shift-OR,Shift-AND• Boyer-Moore• Factor searches

• Regular expressions(?)• Weight matrices(?)

Multi-pattern• Aho Corasick• Commentz-Walter

Indexing• Trie (andsuffixtrie)• Suffixtree

Exactpatternmatching

• S=s1 s2… sn (text) |S|=n(length)

• P=p1p2..pm (pattern) |P|=m

• Σ - alphabet | Σ|=c

• DoesScontainP?– DoesS=S'PS"fosomestringsS'jaS"?– Usuallym<<nandncanbe(very)large

Findoccurrencesintext

S

P

Animations• http://www-igm.univ-mlv.fr/~lecroq/string/

• EXACTSTRINGMATCHINGALGORITHMSAnimationinJava

• ChristianCharras- ThierryLecroqLaboratoired'InformatiquedeRouenUniversitédeRouenFacultédesSciencesetdesTechniques76821Mont-Saint-AignanCedexFRANCE

• e-mails:{Christian.Charras,Thierry.Lecroq}@laposte.net

Page 2: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

2

Bruteforce:BABintext?

A B A C A B A B B A B B B AB A B

BruteForce

S

Pi i+j-1

j

Identifythefirstmismatch!

Question:

§Problemsofthismethod?§Ideastoimprovethesearch?

L

J

Bruteforce

AlgorithmNaiveInput:TextS[1..n]and

patternP[1..m]Output:Allpositionsi,where

PoccursinS

for(i=1;i<=n-m+1;i++)for (j=1;j<=m;j++)if(S[i+j-1]!=P[j])break;

if (j>m)printi;

attempt 1:gcatcgcagagagtatacagtacgGCAg....

attempt 2:gcatcgcagagagtatacagtacgg.......

attempt 3:gcatcgcagagagtatacagtacg

g.......

attempt 4:gcatcgcagagagtatacagtacg

g.......

attempt 5:gcatcgcagagagtatacagtacg

g.......

attempt 6:gcatcgcagagagtatacagtacg

GCAGAGAG

attempt 7:gcatcGCAGAGAGtatacagtacg

g.......

BruteforceorNaiveSearch

1 function NaiveSearch(string s[1..n],string sub[1..m])2 for i from 1to n-m+13 for j from 1tom4 if s[i+j-1]≠sub[j]5 jumptonextiterationofouterloop6 return i7return notfound

Ccodeint bf_2( char* pat, char* text , int n ) /* n = textlen */{

int m, i, j ; int count = 0 ; m = strlen(pat);

for ( i=0 ; i + m <= n ; i++) {

for( j=0; j < m && pat[j] == text[i+j] ; j++) ;

if( j == m )count++ ;

}

return(count);}

Ccodeint bf_1( char* pat, char* text ) {

int m ; int count = 0 ; char *tp;

m = strlen(pat); tp=text ;

for( ; *tp ; tp++ ) {if( strncmp( pat, tp, m ) == 0 ) {

count++ ; }

}

return( count ); }

Page 3: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

3

MainproblemofNaive

• ForthenextpossiblelocationofP,checkagainthesamepositionsofS

S

Pi i+j-1

jS

j

Goals

• Makesureonlyaconstantnrofcomparisons/operationsismadeforeachpositioninS– Move(only)fromlefttorightinS

– How?– AfteratestofS[i]<>P[j]whatdowenow?

Knuth-Morris-Pratt

• Makesurethatnocomparisons“wasted”

• AftersuchamismatchwealreadyknowexactlythevaluesofgreenareainS!

D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.

x

y≠

Knuth-Morris-Pratt

• Makesurethatnocomparisons“wasted”

• P– longestsuffixofanyprefixthatisalsoaprefixofapattern

• Example: ABCABD

D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.

prefix x

prefix y

p z

ABCABD

AutomatonforABCABD

1 2 3 4 5 6 7A AB C B D

NOT A

AutomatonforABCABD

1 2 3 4 5 6 7A AB C B D

NOT A

0 1 1 1 2 3 1Fail:

A B C A B DPattern:

1 2 3 4 5 6

Page 4: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

4

KMPmatching

Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)

i=1; j=1; initfail(P) // Prepare fail linksrepeat

if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link

until j>m or i>n if j>m then report match at i-m

Initializationoffaillinks

Algorithm:KMP_InitfailInput:PatternP[1..m]Output:fail[]forpatternP

i=1, j=0 , fail[1]= 0 repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = jelse j = fail[j]

until i>=m

Initializationoffaillinks

i=1, j=0 , fail[1]= 0 repeat

if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j

else j = fail[j]until i>=m

0Fail:

ABCABDi

j

0 1

0 1 1 1

ABCABD

0 1 1 1 2

TimecomplexityofKMPmatching?

Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)

i=1; j=1; initfail(P) // Prepare fail linksrepeat

if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link

until j>m or i>n if j>m then report match at i-m

Analysisoftimecomplexity

• Ateverycycleeitheriandjincreaseby1• Orjdecreases(j=fail[j])

• icanincreasen(orm)times• Q:Howoftencanjdecrease?

– A:notmorethannrofincreasesofi

• Amortisedanalysis: O(n),preprocessO(m)

Karp-Rabin

• CompareinO(1)ahashofPandS[i..i+m-1]

• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)

R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

i..(i+m-1)

1..m

h(T[i.. i+m-1])

h(P)

Page 5: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

5

Karp-Rabin

• CompareinO(1)ahashofPandS[i..i+m-1]

• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)

R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

i..(i+m-1)

1..m

h(T[i+1..i+m])

h(P)

i..(i+m-1)

Hash

• “Remove” theeffectofT[i]and“Introduce”theeffectofT[i+m]– inO(1)

• Usebase|Σ|arithmeticsandtreatcharctersasnumbers

• Incaseofhashmatch– checkallmpositions• Hashcollisions=>WorstcaseO(nm)

Let’susenumbers

• T=57125677• P=125(andforsimplicity,h=125)

• H(T[1])=571• H(T[2])=(571-5*100)*10+2 =712

• H(T[3])=(H(T[2])– ord(T[1])*10m)*10+T[3+m-1]

hash

• c– sizeofalphabet

• HSi=H(S[i..i+m-1])

• H(S[i+1..i+m])=(HSi– ord(S[i])*cm-1 )*c+ord(S[i+m])

• Moduloarithmetic– tofitvalueinaword!

• hash(w[0..m-1])=(w[0]*2m-1+w[1]*2m-2+···+w[m-1]*20)modq

Karp-RabinInput: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */

2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0

5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position

8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q

10. if hp == hs and P == S[i..i+m-1] 11. report match at position i

Page 6: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

6

MorewaystoensureO(n)? Shift-AND/Shift-OR

• RicardoBaeza-Yates,GastonH.GonnetAnewapproachtotextsearchingCommunicationsoftheACM October1992,Volume35Issue10[ACMDigitalLibrary:http://doi.acm.org/10.1145/135239.135243][DOI]

• PDF

Bit-operations

• Maintainasetofallprefixesthathavesofarhadaperfectmatch

• Onthenextcharacterintextupdateallpreviouspointerstoanewset

• Bitvector:foreverypossiblecharacter

State:whichprefixesmatch?

1

0

0

1

0

Move to next:shift 1,introduce 1,bitwise and

1

0 0

0

1

1 1

0 0

01

0

0

0

1

1

1

0

0

0

&

Pattern[S[i]]

1

1

1

0

0

=

Trackpositionsofprefixmatches

0 1 0 1 0 1

1 0 0 0 1 1

1 0 1 0 1 1 Shift left <<

1 0 0 0 1 1Mask on char T[i] Bitwise AND

Page 7: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

7

VectorsforeverycharinΣ

• P=aste

a s t e b c d .. z

1 0 0 0 0 ...

0 1 0 0 0 ...

0 0 1 0 0 ...0 0 0 1 0 ...

• T=lasteaed

l a s t e a e d

0 1

0 0

0 00 0

• T=lasteaed

l a s t e a e d

0 1 0

0 0 1

0 0 00 0 0

• T=lasteaed

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 00 0 0 0 1 0

• T=lasteaed

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 00 0 0 0 1 0

http://www-igm.univ-mlv.fr/~lecroq/string/node6.html

Page 8: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

8

[A]11010101

SummaryAlgorithm Worstcase Ave.Case Preprocess

Bruteforce O(mn) O(n*(1+1/|Σ|+..)

Knuth-Morris-Pratt O(n) O(n) O(m)

Rabin-Karp O(mn) O(n) O(m)

Boyer-Moore O(n/m)?

BMHorspool

Factorsearch

Shift-OR O(n) O(n) O(m|Σ|)

• R.Boyer,S.Moore:Afaststringsearchingalgorithm.CACM 20(1977),762-772[PDF]

• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf

47

Findoccurrencesintext

• Havewemissedanything?

48

S

P

Page 9: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

9

Findoccurrencesintext

• Whathavewelearnedifwetestforapotentialmatchfromtheend?

49

S

P

ABCDEBBCDE

50

Findoccurrencesintext

S

P

AB

51

BadcharacterheuristicsmaximalshiftonS[i]

S

P

AB

X

SXX

delta1( S[i] ) – |m| if pattern does not contain S[i]patlen-j max j so that P[j] == S[i]

S[i]

First x in pattern (from end)

52

void bmInitocc() {

char a; int j; for(a=0; a<alphabetsize; a++)

occ[a]=-1;

for (j=0; j<m; j++) {

a=p[j]; occ[a]=j; }

}53

Goodsuffixheuristics

S

P

AB

µ

S

delta2( S[i] ) – minimal shift so that matched region is fully coveredor that the sufix of match is also a prefix of P

µµS

µµ’

1.

2.

54

Page 10: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

10

Boyer-Moorealgorithm

Input: Text S[1..n] and pattern P[1..m]

Output: Occurrences of P in S

preprocess_BM() // delta1 and delta2

i=m

while i <= n

for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ;

if j==0 report match at position i-m+1

i = i+ max( delta1[ S[i] ], delta2[ j ] )

55

• http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm

• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf

• Animation:http://www-igm.univ-mlv.fr/~lecroq/string/

56

SimplificationsofBM

• TherearemanyvariantsofBoyer-Moore,andmanyscientificpapers.

• Onaveragethetimecomplexityissublinear• Algorithmspeedcanbeimprovedandyetsimplifythecode.

• Itisusefultousethelastcharacterheuristics(Horspool(1980),Baeza-Yates(1989),HumeandSunday(1991)).

57

AlgorithmBMH(Boyer-Moore-Horspool)

• RNHorspool - PracticalFastSearchinginStringsSoftware- PracticeandExperience,10(6):501-5061980

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j

3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]

58

StringMatching:Horspoolalgorithm

Text :

Pattern :From right to left: suffix search

• Which is the next position of the window?

• How the comparison is made?

Pattern :

Text : a

It depends of where appears the last letter of the text, say it ‘a’, in the pattern:

a a a

Then it is necessary a preprocess that determines the length of the shift.

aa a

a a a

AlgorithmBoyer-Moore-Horspool-Hume-Sunday(BMHHS)

• Usedeltainatightloop• Ifmatch(delta==0)thencheckandapplyoriginaldeltad

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m]4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 05. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop7. t=delta[ S[i] ] ; i = i + t 8. until t==09. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ;10. if j==0 report match at i-m+1

BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P(in order for the algorithm to finish correctly – at least one occurrence!).

60

Page 11: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

11

• DanielM.Sunday: Averyfastsubstringsearchalgorithm[PDF]CommunicationsoftheACMAugust1990,Volume33Issue8

• Loopunrolling:• Avoidtoomanyloops(eachlooprequirestests)byjustrepeatingcode

withintheloop.• Line7inpreviousalgorithmcanbereplacedby:

7. i += delta[ S[i] ];i += delta[ S[i] ];i +=(t=delta[S[i]]) ;

61 62

Forward-Fast-Search:AnotherFastVariantoftheBoyer-MooreStringMatchingAlgorithm

• ThePragueStringologyConference'03• DomenicoCantoneandSimoneFaro• Abstract: WepresentavariationoftheFast-Searchstringmatching

algorithm,arecentmemberofthelargefamilyofBoyer-Moore-likealgorithms,andwecompareitwithsomeofthemosteffectivestringmatchingalgorithms,suchasHorspool,QuickSearch,TunedBoyer-Moore,ReverseFactor,Berry-Ravindran,andFast-Searchitself.Allalgorithmsarecomparedintermsofrun-timeefficiency,numberoftextcharacterinspections,andnumberofcharactercomparisons.Itturnsoutthatournewproposedvariant,thoughnotlinear,achievesverygoodresultsespeciallyinthecaseofveryshortpatternsorsmallalphabets.

• http://cs.felk.cvut.cz/psc/event/2003/p2.html• PS.gz (localcopy)

63

Factorbasedapproach

• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability

• Factor– asubstringofapattern– Anysubstring– (howmany?)

64

Factorbasedapproach

• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability

65

Factorsearches

Do not compare characters, but find the longest match to anysubregion of the pattern.

S

P

X u

66

Page 12: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

12

Examples

• BackwardDAWGMatching(BDM)– Crochemoreetal1994

• BackwardNondeterministicDAWGMatching(BNDM)– Navarro,Raffinot2000

• BackwardOracleMatching(BOM)– Allauzen,Crochermore,Raffinot2001

67

BackwardDAWGMatchingBDM

Do not compare characters, but find the longest match to anysubregion of the pattern. 68

Suffix automaton recognises all factors (and suffixes) in O(n)

BNDM– simulateusingbitparallelism

69

Bits – show where the factors have occurred so far

BNDMmatchesanNDA

NDAonthesuffixesof‘announce’

70

DeterministicversionofthesameBackwardFactorOracle

71

BNDM – Backward Non-Deterministic DAWG MatchingBOM - Backward Oracle matching

72

Page 13: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

13

StringMatchingofonepattern

CTACTACTACGTCTATACTGATCGTAGCTACTACGGTATGACTAA

Factor search

Prefix search

Suffix search

1.

2.

3.

Multiplepatterns

S

{P}

Why?

• Multiplepatterns• Highlightmultipledifferentsearchwords onthepage• Virusdetection – filterforvirussignatures• Spamfilters• Scannerincompiler needstosearchformultiplekeywords• Filterout stopwordsordisallowedwords• Intrusiondetectionsoftware• Next-generationsequencingproduceshugeamounts

(manymillions)ofshortreads(20-100bp)thatneedtobemappedtogenome!

• …

Algorithms

• Aho-Corasick(searchformultiplewords)– GeneralizationofKnuth-Morris-Pratt

• Commentz-Walter– GeneralizationofBoyer-Moore&AC

• WuandManber– improvementoverC-W

• Additionalmethods,tricksandtechniques

Aho-Corasick(AC)• AlfredV.AhoandMargaretJ.Corasick(BellLabs,MurrayHill,NJ)

Efficientstringmatching.Anaidtobibliographicsearch.CommunicationsoftheACM,Volume18,Issue6,p333-340(June1975)

• ACM:DOI PDF• ABSTRACT Thispaperdescribesasimple,efficientalgorithmtolocateall

occurrencesofanyofafinitenumberofkeywordsinastringoftext.Thealgorithmconsistsofconstructingafinitestatepatternmatchingmachinefromthekeywordsandthenusingthepatternmatchingmachinetoprocessthetextstringinasinglepass.Constructionofthepatternmatchingmachinetakestimeproportionaltothesumofthelengthsofthekeywords.Thenumberofstatetransitionsmadebythepatternmatchingmachineinprocessingthetextstringisindependentofthenumberofkeywords.Thealgorithmhasbeenusedtoimprovethespeedofalibrarybibliographicsearchprogrambyafactorof5to10.

References:

• GeneralizationofKMPformanypatterns• TextSlikebefore.• SetofpatternsP ={P1 ,..,Pk }• Totallength|P|=m=Σi=1..k mi

• Problem:findalloccurrencesofany ofthePi∈ P fromS

Page 14: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

14

Idea

1. Createanautomaton fromallpatterns

2. Matchtheautomaton

• UsethePATRICIAtrieforcreatingthemainstructureoftheautomaton

PATRICIAtrie• D.R.Morrison,"PATRICIA:PracticalAlgorithmToRetrieveInformation

CodedInAlphanumeric",JournaloftheACM15(1968)514-534.• Abstract PATRICIAisanalgorithmwhichprovidesaflexiblemeansof

storing,indexing,andretrievinginformationinalargefile,whichiseconomicalofindexspaceandofreindexingtime.Itdoesnotrequirerearrangementoftextorindexasnewmaterialisadded.Itrequiresaminimumrestrictionofformatoftextandofkeys;itisextremelyflexibleinthevarietyofkeysitwillrespondto.Itretrievesinformationinresponsetokeysfurnishedbytheuserwithaquantityofcomputationwhichhasaboundwhichdependslinearlyonthelengthofkeysandthenumberoftheirproperoccurrencesandisotherwiseindependentofthesizeofthelibrary.IthasbeenimplementedinseveralvariationsasFORTRANprogramsfortheCDC-3600,utilizingdiskfilestorageoftext.Ithasbeenappliedtoseverallargeinformation-retrievalproblemsandwillbeappliedtoothers.

• ACM:DOI PDF

• Wordtrie - agooddatastructuretorepresentasetofwords(e.g.adictionary).

• trie (datastructure)

• Definition: Atreeforstoringstringsinwhichthereisonenodeforeverycommonpreffix.Thestringsarestoredinextraleafnodes.

•Seealsodigitaltree,digitalsearchtree,directedacyclicwordgraph,compactDAWG,Patriciatree,suffixtree.

•Note: Thenamecomesfromretrievalandispronounced,"tree."

• Totestforawordp,onlyO(|p|)timeisusednomatterhowmanywordsareinthedictionary...

TrieforP={he,she,his,hers}

0

1

2

h

e

0

1

2

h

e

3

s

4

5

e

h

TrieforP={he,she,his,hers}0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

Page 15: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

15

Howtosearchforwordslikehe,sheila,hi.Dotheseoccurinthetrie?

0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

Aho-Corasick

1. CreateanautomatonMP forasetofstringsP.2. Finitestatemachine:reada characterfromtext,and

changethestateoftheautomatonbasedonthestatetransitions...

3. Mainlinks:goto[j,c] - readacharactercfromtextandgofromastatejtostategoto[j,c].

4. Iftherearenogoto[j,c]linksoncharactercfromstatej,usefail[j].

5. Reporttheoutput.Reportallwordsthathavebeenfoundinstatej.

ACAutomaton(vsKMP)

0

1

2

h

e3

s

4

5

e

h

8

i

7

s

9

r6

s

goto[1,i] = 6. ;

fail[7] = 3, fail[8] = 0 , fail[5]=2.

Output tablestate output[j] 2 he 5 she, he 7 his 9 hers

NOT { h, s }

AC- matching

Input:TextS[1..n]andanACautomatonMforpatternsetPOutput:OccurrencesofpatternsfromPinS(lastposition)1. state=02. for i=1..ndo

3. while (goto[state,S[i]]==∅ )and (fail[state]!=state)do4. state=fail[state]5. state=goto[state,S[i]]6. if (output[state]notempty)7. then reportmatchesoutput[state]atpositioni

AlgorithmAho-CorasickpreprocessingI(TRIE)Input:P={P1,...,Pk }Output:goto[]andpartialoutput[]Assume:output(s)isemptywhenastatesiscreated;

goto[s,a]isnotdefined.

procedure enter(a1,...,am)/*Pi =a1,...,am */begin1.s=0;j=1;2.while goto[s,aj]≠∅ do //followexistingpath3.s=goto[s,aj];4.j=j+1;5.for p=jtomdo //addnewpath(states)6.news=news+1;7.goto[s,ap]=news;8.s=news;9.output[s]=a1,...,amend

begin10. news = 011. for i=1 to k do enter( Pi )12. for a ∈ Σ do

13. if goto[0,a] = ∅ then goto[0,a] = 0 ; end

PreprocessingIIforAC(FAIL)queue = ∅for a ∈ Σ do

if goto[0,a] ≠ 0 thenenqueue( queue, goto[0,a] )fail[ goto[0,a] ] = 0

while queue ≠ ∅r = take( queue )for a ∈ Σ do

if goto[r,a] ≠ ∅ then s = goto[ r, a ]enqueue( queue, s ) // breadth first searchstate = fail[r]while goto[state,a] = ∅ do state = fail[state]fail[s] = goto[state,a]output[s] = output[s] + output[ fail[s] ]

Page 16: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

16

Correctness

• Letstringt"point"frominitialstatetostatej.

• Mustshowthatfail[j]pointstolongestsuffixthatisalsoaprefixofsomewordinP.

• Lookatthearticle...

ACmatchingtimecomplexity

• Theorem FormatchingtheMP ontextS,|S|=n,lessthan2ntransitionswithinMaremade.

• Proof ComparetoKMP.• Thereisatmostngotosteps.• CannotbemorethannFail-steps.• Intotal-- therecanbelessthan2ntransitionsinM.

Individualnode(goto)

• Fulltable

• List

• Binarysearchtree(?)

• Someotherindex?

ACthoughts

• Scalesformanystringssimultaneously.• Forverymanypatterns– searchtime(ofgrep)improves(??)

– SeeWu-Manberarticle

• Whenkgrows,thenmorefail[]transitionsaremade(why?)• Butalwayslessthann.• Ifallgoto[j,a]areindexedinanarray,thenthesizeis

|MP|*|Σ|,andtherunningtimeofACisO(n).• Whenkandcarebig,onecanuselistsortreesforstoring

transitionfunctions.

• Then,O(nlog(min(k,c))).

AdvancedAC

• Precalculatethenextstatetransitioncorrectlyforeverypossiblecharacterinalphabet

• Canbegoodforshortpatterns

ProblemsofAC?

• Needtorebuildonadding/removingpatterns

• Detailsofbranchingoneachnode(?)

Page 17: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

17

Commentz-Walter

• GeneralizationofBoyer-Mooreformultiplesequencesearch

• BeateCommentz-WalterAStringMatchingAlgorithmFastontheAverageProceedingsofthe6thColloquium,onAutomata,LanguagesandProgramming.LectureNotesInComputerScience;Vol.71,1979. pp.118- 132,Springer-Verlag

• http://www.fh-albsig.de/win/personen/professoren.php?RID=36• YoucandownloadheremyalgorithmStringMatchingFastOnTheAverage (PDF,~17,2MB)or

hereStringMatchingFastOnTheAverage(extendedabstract) (PDF,~3MB)

C-Wdescription

• AhoandCorasick[AC75]presentedalinear-timealgorithmforthisproblem,basedonanautomataapproach.ThisalgorithmservesasthebasisfortheUNIXtoolfgrep.Alinear-timealgorithmisoptimalintheworstcase,butastheregularstring-searchingalgorithmbyBoyerandMoore[BM77]demonstrated,itispossibletoactuallyskipalargeportionofthetextwhilesearching,leadingtofasterthanlinearalgorithmsintheaveragecase.

Commentz-Walter[CW79]

• Commentz-Walter[CW79]presentedanalgorithmforthemulti-patternmatchingproblemthatcombinestheBoyer-MooretechniquewiththeAho-Corasickalgorithm.TheCommentz-WalteralgorithmissubstantiallyfasterthantheAho-Corasickalgorithminpractice.Hume[Hu91]designedatoolcalledgrebasedonthisalgorithm,andversion2.0offgrepbytheGNUproject[Ha93]isusingit.

• Baeza-Yates[Ba89]alsogaveanalgorithmthatcombinestheBoyer-Moore-Horspoolalgorithm[Ho80](whichisaslightvariationoftheclassicalBoyer-Moorealgorithm)withtheAho-Corasickalgorithm.

IdeaofC-W

• Buildabackward trieofallkeywords

• Matchfromtheenduntilmismatch...

• Determinetheshiftbasedonthecombinationofheuristics

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

4. Start the search

T A

A

G

GA

TTT

T

G

A

A

AA T

1. Build the trie of the inverted patterns

2. lmin=4A 1C 4 (lmin)G 2T 1

3. Table of shifts

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Page 18: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

18

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

…Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

The text ACATGCTATGTGACA…

A 1C 4 (lmin)G 2T 1

Short Shifts!

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Page 19: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

19

WhatarethepossiblelimitationsforC-W?

• Manypatterns,smallalphabet– minimalskips

• Whatcanbedonedifferently?

Wu-Manber• WuS.,andU.Manber,"AFastAlgorithmforMulti-PatternSearching,"

TechnicalReportTR-94-17,DepartmentofComputerScience,UniversityofArizona(May1993).

• Citeseer:http://citeseer.ist.psu.edu/wu94fast.html [Postscript]• WepresentadifferentapproachthatalsousestheideasofBoyerand

Moore.Ouralgorithmisquitesimple,andthemainengineofitisgivenlaterinthepaper.Anearlierversionofthisalgorithmwaspartofthesecondversionofagrep[WM92a,WM92b],althoughthealgorithmhasnotbeendiscussedin[WM92b]andonlybrieflyin[WM92a].Thecurrentversionisusedinglimpse[MW94].Thedesignofthealgorithmconcentratesontypicalsearchesratherthanonworst-casebehavior.Thisallowsustomakesomeengineeringdecisionsthatwebelievearecrucialtomakingthealgorithmsignificantlyfasterthanotheralgorithmsinpractice.

Keyidea

• MainproblemwithBoyer-Mooreandmanypatternsisthat,themoretherearepatterns,theshorterbecomethepossibleshifts...

• WuandManber:checkseveralcharacterssimultaneously,i.e.increasethealphabet.

• Insteadoflookingatcharactersfromthetextonebyone,weconsidertheminblocksofsizeB.

• logc2M;inpractice,weuseeitherB=2orB=3.• TheSHIFTtable playsthesameroleasintheregularBoyer-Moorealgorithm,exceptthatitdeterminestheshiftbasedon thelastBcharactersratherthanjustonecharacter.

AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…

2 símbols

Horspoolto Wu-ManberHow do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG

AA 1AT 1GT 1TA 2TG 2

A 1C 4 (lmin)G 2T 1

1 símbol

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Wu-ManberalgorithmSearch for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

GA

TTT

T

G

A

A

AA T

into the text: ACATGCTATGTGACATAATA

AA 1AT 1GT 1TA 2TG 2

Experimental length: log|Σ| 2*lmin*rSlides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Page 20: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

20

BackwardOracle

• SetBackwardsoracleSBDM,SBOM

• Pages68-72

Stringmatchingofmanypatterns

5 10 15 20 25 30 35 40 45

8

4

2

| S|

Wu-Manber

SBOMLmin

(5 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM(10 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

(100 patterns)

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Stringmatchingofmanypatterns

5 10 15 20 25 30 35 40 45

8

4

2

| S|

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

SBOM

Lmin

(5 patterns)

(10 patterns)

(100 patterns)(1000 patterns)

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

5strings 10strings

Page 21: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

21

100strings 1000strings

FactorOracle FactorOracle:safeshift

FactorOracle:

Shift to match prefix of P2?

Factororacle

Page 22: Algorithms Exact pattern matching - Kursused...23.11.16 2 Brute force: BAB in text? ABACABABBABBBA BAB Brute Force S P i i+j-1 j Identify the first mismatch! Question: Problems of

23.11.16

22

ConstructionoffactorOracle Factororacle• Allauzen,C.,Crochemore,M.,andRaffinot,M.1999.FactorOracle:ANew

StructureforPatternMatching.InProceedingsofthe26thConferenceonCurrentTrendsintheoryandPracticeofinformaticsontheoryandPracticeofinformatics (November27- December04,1999).J.Pavelka,G.Tel,andM.Bartosek,Eds.LectureNotesInComputerScience,vol.1725.Springer-Verlag,London,295-310.

• http://portal.acm.org/citation.cfm?id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=61811641#

• http://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps

Sofar

• GeneralisedKMP->AhoCorasick• GeneralisedHorspool->CommentzWalter,WuManber

• BDM,BOM->SetBackwardOracleMatching…

• Othergeneralisations?

MultipleShift-AND

• P={P1,P2,P3,P4}. GeneralizeShift-AND

• Bits=

• Start=

• Match=

P1P2P3P4

1111

1111