View
75
Download
2
Category
Preview:
DESCRIPTION
Approximate String Matching. A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman. Outline:. Definition of approximate string matching (ASM) Applications of ASM Algorithms Conclusion. Approximate string matching. - PowerPoint PPT Presentation
Citation preview
Approximate String Matching
Approximate String MatchingA Guided Tour to Approximate String MatchingGonzalo Navarro
Justin Wiseman11Outline:Definition of approximate string matching (ASM)
Applications of ASM
Algorithms
Conclusion
22Approximate string matchingApproximate string matching is the process of matching strings while allowing for errors.33The edit distanceStrings are compared based on how close they are
This closeness is called the edit distance
The edit distance is summed up based on the number of operations required to transform one string into another
44Levenshtein / edit distanceNamed after Vladimir Levenshtein who created his Levenshtein distance algorithm in 1965
Accounts for three basic operations:
Inserts , deletions, and replacements
In the simplified version, all operations have a cost of 1
Example: mash and march have edit distance of 255Other distance algorithmsHamming distance:Allows only substitutions with a cost of one each
Episode distance:Allows only insertions with a cost of one each
Longest Common Subsequence distance:Allows only insertions and deletions costing one each
66Outline:What is approximate string matching (ASM)?
What are the applications of ASM?
Algorithms
Conclusion77ApplicationsComputational biology
Signal processing
Information retrieval
88Computational biologyDNA is composed of Adenine, Cytosine, Guanine, and Thymine (A,C,G,T)
One can think of the set {A,C,G,T} as the alphabet for DNA sequences
Used to find specific, or similar DNA sequences
Knowing how different two sequences are can give insight to the evolutionary process.99Signal processingUsed heavily in speech recognition software
Error correction for receiving signals
Multimedia and song recognition
1010Information RetrievalSpell checkers
Search enginesWeb searches (Google)Personal files (agrep for unix)
Searching texts with errors such as digitized books
Handwriting recognition1111Outline:What is approximate string matching (ASM)?
What are the applications of ASM?
Algorithms
Conclusion1212AlgorithmsDefinitions
Dynamic Programming algorithms
Automatons
Bit-parallelism
Filters
1313DefinitionsLet be a finite alphabet of size || = Let T * be a text of length n = |T|Let P * be a pattern of length m = |P|Let k R be the maximum error allowedLet d : * * R be a distance functionTherefore, given T, P, k, and d(.), return the set of all text positions j such that there exists i such that d(P, Ti..j) k
1414AlgorithmsDefinitions
Dynamic Programming algorithms
Automatons
Bit-parallelism
Filters
1515Dynamic Programmingoldest to solve the problem of approximate string matching
Not very efficient Runtime of O(|x||y|)However, space is O(min(|x||y|))
Most flexible when adapting to different distance functions
1616Computing the edit distanceTo compute the edit distance: ed(x,y) Create a matrix C0..|x|,0..|y| where Ci,j represents the minimum operations needed to match x1..i to y1..j
Ci,0 = iC0,j = jCi,j = if(xi = yj) then Ci-1, j-1 else 1 + min(Ci-1,Ci,j-1, Ci-1,j-1) 1717Edit distance exampleCi,0 = iC0,j = jif(xi = yj) Ci,j = Ci-1, j-1else Ci,j = 1 +min(Ci-1, Ci,j-1, Ci-1,j-1)
18
18Text searchingThe previous algorithm can be converted to search a text for a given pattern with few changes
Let y = Pattern, and x = TextSet C0,j = 0 so that any text position is the start of a matchCi,j = if(Pi = Tj) then Ci-1,j-1else 1+min(Ci-1,j, Ci,j-1, Ci-1,j-1)1919Text search exampleIn English: if the letters at the index are the same, then the current position = the top left position. If the letters are not the same, then the current position is the minimum of left, top, and top left plus one. 20
20ImprovementsExample algorithm listed was the first
Many DP based algorithms improved on the search time
In 1992, Chang and Lampe produce new algorithm called column partitioning with an average search time of O(kn) where k=errors, n=text length, and =size of alphabet2121AlgorithmsDefinitions
Dynamic Programming algorithms
Automatons
Bit-parallelism
Filters
2222Automatons for approx. searchModel search with a nondeterministic finite automata
1985: Esko Ukkonen proposes a deterministic form
Fast: deterministic form has O(n) worst case search time
Large: space complexity of DFA grows exponentially with respect to the pattern length2323NFA example with k = 2Matching the pattern survey on text surgery
2424ImprovementsIn 1996 Kurtz[1996] proposes lazy construction of DFA
Space requirements reduced to O(mn) 2525AlgorithmsDefinitions
Dynamic Programming algorithms
Automatons
Bit-parallelism
Filters
2626Bit-parallelismTakes advantage of the inherent parallelism of computer when dealing in bits
Changes an existing algorithm to operate at the bit level
Operations can be reduced by factor of w where w is the number of bits in a word
2727Shift-OrWas the first bit-parallel algorithm
Parallelizes the operation of an NFA that tries to match the pattern exactly
NFA has m+1 states28
28Builds table B which stores a bit mask for every character cFor the mask B[c], the bit bi is set if and only if Pi = cSearch state is kept in a machine word D = dm..d1di is 1 when P1..i matches the end of the text scanned so farMatch is registered when dm = 129
29To start, D is set to 1m
D is updated upon reading a new text character using the following formula
D ((D
Recommended