15
05/18/22 1 LCS

10/12/20141LCS. Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

Embed Size (px)

Citation preview

Page 1: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 1LCS

Page 2: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

Given two strings S1 of length m and S2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest substring of S1

that is also a substring of S2.

A generalization is the k-common substring problem. Given the set of strings S={S1,S2,……………,Sk}. where |Si|=ni . Σ ni=N. Find for each 2 ≤ k ≤ K , the longest string which occur as substring of all strings.

04/11/23 2LCS

Page 3: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 3

Page 4: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 4

Page 5: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 5

Page 6: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 6

A B A B

0 0 0 0 0

B0 0 1 0 1

A0 1 0 2 0

B0 0 2 0 3

A0 0 0 3 0

ij

A B A B A B

Longest Common Substring Longest Common Substring

Page 7: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 7

ac

c

b

b

b

c

$ #

$ #

b

c

$

#

a

b

c

#

c

$

Page 8: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 8

Common Sub-Strings

‘a’

‘b’

‘c’

‘ab’

‘bc’

Longest Common Sub-String

‘ab’

‘bc’

Page 9: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 9

LCS compares two strings and finds the longest run of characters that occurs in both of them.

We can then declare the two documents as near duplicates if the ratio of the common substring length to the length of the documents exceeds some threshold.

Consider the Example BelowSelling a beautiful house in California.Buying a beautiful chip in California.

The longest common substring is " in California." (it is 15 characters long, whereas " a beautiful " comes in second at 13 characters long). The first string is 40 characters long. So, you could assess how similar the strings are by taking the ratio: 15/40 = 0.375.

Best part about this application is that user can decide the threshold level interactively

Target Audience of this Application*Ideal for Universities which do not have access to turn it in.*Students who do not have access to turn it .

Page 10: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 10

Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources.

Two corresponding fields within a record are said to agree only if all characters match; otherwise the fields are considered as mismatches.

LCS score for the names ‘TAMMY SHACKELFORD’ ‘TAMMIE SHACKLEFORD’ The total length of the common substrings is [5 (SHACK) + 4 (TAMM) + 4 (FORD)] = 13.

The length of the shorter name string (ignoring white space) is 16, therefore the LCS score is(13÷16) = 0.8125

Page 11: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 11

Approach Worst Case Time Complexity

Brute Force O(n^3)

Dynamic Programming O(m n)

Suffix Array O(n log n)

Suffix Tree O(n)

Page 12: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 12

Approach Time Complexity

Time(ms) Basic Operations Execution Time (mille seconds)

Brute Force n^3 129

Dynamic Programming m*n 68

Suffix Tree n 29

Approach Time Complexity

Time(ms) Basic Operations Execution Time (mille seconds)

Brute Force n^3 273

Dynamic Programming m*n 120

Suffix Tree n 60

Results were Measured on Intel Core i-7 2.00 GHZ processor 4GB Ram System

Page 13: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 13

In Dynamic programming following changes can be done to exiting algorithm to reduce the memory usage of an implementation :-Keep only the last and current row of the Dynamic Programming table to save memory O(min(m, n)) instead of O(n m)).Store only non-zero values in the rows. This can be done using hash tables instead of arrays. This is useful for large alphabets.

Exiting Ukkonen’s suffix-tree implementation of longest common substring problem can be modified using McCreight and Weiner to see marginal improvement in time and space complexities.

Hybrid algorithm's performance can be compared with exiting performance results and see if there is any significant change in time and space complexity using rolling hash and suffix arrays

Page 14: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 14

On–line construction of Su x trees ffihttp://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

Longest common substring problemhttp://en.wikipedia.org/wiki/Longest_common_substring_problem#See_also

Generalized suffix tree http://en.wikipedia.org/wiki/Generalized_suffix_tree

Real World Performance of Approximate String Comparators for use in Patient Matchinghttp://www.cs.mun.ca/~harold/Courses/Old/CS6772.F04/Diary/5604Grannis.pdf

Page 15: 10/12/20141LCS.  Given two strings S 1 of length m and S 2 of length n over the same alphabeth. The Longest Common Substring problem is to find the longest

04/11/23 LCS 15