13
for Searching Text Charles L.A. Clarke and Gordon V. Cormack http://doi.acm.org/10.1145/256167.2 56174=20 Fast Text Searching for Regular Expressions or Automaton Searching on Tries Ricardo Baeza-Yates, Gaston H. Gonnet http://doi.acm.org/10.1145/235809.2 35810

On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

Embed Size (px)

Citation preview

Page 1: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

On the Use of Regular Expressions for Searching Text

Charles L.A. Clarke and Gordon V. Cormackhttp://doi.acm.org/10.1145/256167.256174=20

Fast Text Searching for Regular Expressions or Automaton Searching on Tries

Ricardo Baeza-Yates, Gaston H. Gonnet

http://doi.acm.org/10.1145/235809.235810

Page 2: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

On the Use of Regular Expressions for Searching Text

• New perspective, particularly relevant to structured text

• Definition of the search problem– Does a given string of text match a particular pattern

(regular expression recognition problem)– Locate the substrings of a text that match a particular

pattern (searching problem)– Given a universe U identify all elements of U that

contain a substring x matching a particular pattern r (more precise definition)

Page 3: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

On the Use of Regular Expressions for Searching Text

• Given a string x and a regular expression r, locate all substrings of x that match r (continuous stream of text; problem: quadratic in the length of x; overlapping and nesting results)– Restrict the search to linearize the solutions; not

simple– Most common restriction is the “leftmost longest

match” rule– Problems: what is the next match? Where to start new

search from?

Page 4: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

On the Use of Regular Expressions for Searching Text

• This article prosposes alternative linearizing restriction—”Locate the set of shortest nonnested (but possibly overlapping strings that each match the pattern”.

• Related work” Thomsons’s algorithm, Baeza-Yates

Page 5: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

On the Use of Regular Expressions for Searching Text

• Shortest substring– Definition of the search problem– Comparison between longest and shortest

match search: • shortest-match reports all occurrences of the

members of L that are in G(L) and no others; longest depends on the entire text.

• A string may be recognized as member of a regular language by a single left to right scan with constant store. Longest does not have such properties.

Page 6: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

On the Use of Regular Expressions for Searching Text

• Explicit containment– A regular expression may be used to define

an explicit universe for search. Implement it by running two concurrent copies of the algorithm.

• Search tool: CGREP was developed on the basis of the theory in this article.

Page 7: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

On the Use of Regular Expressions for Searching Text

• Concluding comments:– Explores the properties of shortest match search rule

for regular expressions– The shortest substring rule provides a precise

definition of which strings will be selected during a search without any dependence on the contents of the remainder of the text

– Only single left to right scan is enough– Storage requirements depend on the properties of the

regular expression only– Can define search universes; useful in structured text

(no predefined retrieval units)

Page 8: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

Fast Text Searching for Regular Expressions or Automaton Searching on Tries

• Presents algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index.

• Run in logarithmic expected time in the size of the text for some restricted regular expressions, and in sublinear expected time for any regular expression.

Page 9: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

Fast Text Searching for Regular Expressions or Automaton Searching on Tries

• Pattern matching – find occurrences of a given pattern in a long string

• Variations based on preprocessing the text or not and the language used to specify the query

• In this article the authors consider preprocessed text and a query specified by a regular expression

• The problem: find if text string t ε Σ* q Σ* (q is the query) and 1) the location of occurrence, 2) the number of occurrences, 3) all locations where the pattern occurs (any combination)

Page 10: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

Fast Text Searching for Regular Expressions or Automaton Searching on Tries

• Main idea: Simulation of the finite automaton of the query over a digital tree (or Patricia tree) of the text. Run the automaton on all paths of the digital tree from the root to the leaves, stopping when possible.

• Time savings from the fact that each edge of the tree is traversed at most once, and that every edge represents pairs of symbols in many places of the text.

Page 11: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

Fast Text Searching for Regular Expressions or Automaton Searching on Tries

• Static databases

• Logical index for text

• Definition of sistrings

• Construction of text index which is a binary trie consisting of the set of sistrings of the text

• Use of Patricia tree to reduce the number of internal nodes

Page 12: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

Fast Text Searching for Regular Expressions or Automaton Searching on Tries

• General automaton searching– The authors present an algorithm that can

search for artitrary regular expressions in time sublinear in n on the average. They simulate a DFA in a binary trie built from all the sistrings of a text.

Page 13: On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack 20 Fast Text Searching

Fast Text Searching for Regular Expressions or Automaton Searching on Tries

• Concluding comments– Using a trie or Patricia tree, we can search for

many types of string searching queries in logarithmic average time, independently of the size of the answer

– Automaton searching in a trie is sublinear in the size of the text on average for any regular expression

– Worst case of automata searching is linear (for unusual pieces of text)