13
CSC 212 – Data Structures Lecture 36: Pattern Matching

CSC 212 – Data Structures Lecture 36: Pattern Matching

Embed Size (px)

Citation preview

Page 1: CSC 212 – Data Structures Lecture 36: Pattern Matching

CSC 212 –Data Structures

Lecture 36:

Pattern Matching

Page 2: CSC 212 – Data Structures Lecture 36: Pattern Matching

Suffixes and Prefixes

“I am the Lizard King!”Prefixes Suffixes

II I a

I am…I am the Lizard KinI am the Lizard King I am the Lizard King!

!g!ng!ing!…am the Lizard King!

am the Lizard King!

I am the Lizard King!

Page 3: CSC 212 – Data Structures Lecture 36: Pattern Matching

KMP Algorithm

Asymptotically optimal algorithmMeans cannot do better in big-Oh terms

Compares from left-to-rightSo like BruteForce, not Boyer-MooreBut shifts pattern intelligently

Relies on a Key Insight™Preprocess pattern to avoid redundant

comparisonsAlways go forward; Never, ever look back

Page 4: CSC 212 – Data Structures Lecture 36: Pattern Matching

The KMP Algorithm

x

j

. . a b a a b . . . . .

a b a a b a

a b a a b a

Do notrepeat thesecomparisons

Need to resume

comparinghere

Shifting P hereensures these

two entries match

Page 5: CSC 212 – Data Structures Lecture 36: Pattern Matching

KMP Failure Function

Assume P[j] ≠ T[k]. Need rank in P to next compared to T[k]

E.g., How should we shift P after a miss? Uses failure function, F(j-1),

One value defined for each rank in PSpecifies rank j in P must restart comparisons

Page 6: CSC 212 – Data Structures Lecture 36: Pattern Matching

Computing Failure Function

For rank j, find longest proper prefix and suffix of P[0...j] For speed, store failure function in arrayUnlike Boyer-Moore, works w/infinite alphabets

Takes at most O(2m) = O(m) time

Similar algorithm computes failure function & KMP

Page 7: CSC 212 – Data Structures Lecture 36: Pattern Matching

Computing Failure FunctionAlgorithm KMPFailureFunction(String P)

F[0] 0i 1j 0while i < P.length()

if P[i] = P[j] // So, P[0…j] = P[i - j…i] F[i] j + 1 // Record the length of this prefix/suffix i i + 1 // Advance a character and see if still matches j j + 1else if j > 0 // No match, need to restart our computation j F[j - 1] // Skip over longest prefix that is also a suffixelse F[i] 0 // No prefix of P[0…i] is a suffix of P[0…i] i i + 1 // Move to the next character

return F

Page 8: CSC 212 – Data Structures Lecture 36: Pattern Matching

KMP Failure Functionj 0 1 2 3 4

P[j] a b a a b a

F(j) 0 0 1 1 2

Page 9: CSC 212 – Data Structures Lecture 36: Pattern Matching

The KMP AlgorithmAlgorithm KMPMatch(String T, String P)

F KMPFailureFunction(P)i 0j 0while i < T.length()

if P[j] = T[i] // So, P[0…j] = T[i - j…i] if j = P.length() - 1 return i - j i i + 1 // Advance and see if still a match j j + 1else if j > 0 // No match, but a prefix of P[0…j-1] matches j F[j - 1] // So skip past longest prefix that is a suffixelse i i + 1 // Nothing to reuse, move to the next character

return F

Page 10: CSC 212 – Data Structures Lecture 36: Pattern Matching

Example

1

a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9

a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

j 0 1 2 3 4

P[j] a b a c a b

F(j) 0 0 1 0 1

Page 11: CSC 212 – Data Structures Lecture 36: Pattern Matching

The KMP Algorithm

In each pass of KMPMatch, either:P[j]=T[i] i increases by one, orP[j]≠T[i] & j > 0 P shifted right by at least 1P[j]≠T[i] & j = 0 i increases by 1

So at most 2n iterations of loop KMPMatch takes O(2n) = O(n) time KMPFailureFunction needs O(m) time Thus, algorithm runs in O(m n) time

Page 12: CSC 212 – Data Structures Lecture 36: Pattern Matching

Your Turn

Get back into groups and do activity

Page 13: CSC 212 – Data Structures Lecture 36: Pattern Matching

Before Next Lecture…

Finish up assignments Start thinking about questions for Final