18
Extending Finite Automata to Efficiently Match Perl- Compatible Regular Expressions Publisher : Conference on emerging Networking EXperiments and Technologies (CoNext), 2008 Author : Michela Becchi and Patrick Crowley Presenter : Yu-Hsiang Wang Date : 2011/02/16 1

Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions Publisher : Conference on emerging Networking EXperiments and Technologies

Embed Size (px)

Citation preview

  • Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions

    Publisher : Conference on emerging Networking EXperiments and Technologies (CoNext), 2008Author : Michela Becchi and Patrick CrowleyPresenter : Yu-Hsiang WangDate : 2011/02/16*

  • OutlineIntroductionCounting ConstraintsBack-ReferencesCombining Multiple Reg-ExArchitectureExperimental Evaluation

    *

  • IntroductionAs of November 2007, 5,549 of the 8,536 Snort rules contain at least one Perl- Compatible Regular Expression (PCRE). Among these, 905 (16.3%) and 2,445 (44%) contain unbounded and bounded repetitions of large character classes, respectively, and 338 (6%) include back-references.

    This paper show how the proposed extended-automaton can be combined with the hybrid-FA proposed in [20].*

  • Counting Constraints -NFAWhen the counting constraint n is large, the states of the NFA is linear in n. The basic problem of an NFA representation resides in the fact that, during operation, many states can be active in parallel, leading to a high memory bandwidth requirement and/or processing time. (e.g. aaaaaaaaaaaa...aaaabc)*

  • Counting Constraints -DFA*For large n, number of states is exponential in n.

    e.g. n=40 =>1000 billion states

  • Counting NFAThis basic concept is complicated by observing that, to preserve functional equivalence between the original and the counting-NFA, one instance of the counter is not enough. e.g. axaybzbc *

  • Counting NFADifferential representation - Since the increment operation acts in parallel on all the counter instances, the difference between them will remain constant over execution.

    store oldest (and largest) instance ci and, for j>i, cj=cj-cj-1

    * 9 3 2 2 7 - 2 2n=10

    cci8322

    8531

  • Back-ReferencesA back-reference in a regular expression refers to some sub-expression enclosed within capturing parentheses, and indicates that the referred sub-expression can be matched later within the regular expression itself.

    e.g. .*(abc|bcd).\1y -matches abcdabcdy, does not match abcdaabcy .*a([a-z]+)a\1y -matches babacabacy

    *

  • Back-References( ) Defines a marked sub-expression. The string matched within the parentheses can be recalled later. A marked sub-expression is also called a block or capturing group. \num Matches what the num-th marked sub-expression matched

    e.g. HTML ( abcde ) /^)$/ *

  • Back-ReferencesEach active state can be associated with a set of matched substrings MSk for each back-reference \k. This is performed as follows. (i) When a transition Sx Sy is taken, the set MSk associated to Sx gets moved to Sy. (ii) If the taken transition is tagged k, the current input character is appended to the strings in MSk.

    If a back-reference \k originates from state Sj, Sj is consuming: when active, all the strings in its MSk are processed and shortened (one character at a time)

    Two special conditional transitions representing the back-reference are created. If the input character matches some string in MSk then: (i) transition Sj Sj+1 is taken if the corresponding string is consumed completely; (ii) transition Sj Sj is taken otherwise.*

  • Back-ReferencesRegEx : .*(abc|bcd).\1yInput : abcdabcdy *

  • Combining Multiple Reg-ExStates representing dot-star conditions, which we will call special states.If a regular expression RE1 containing a dot-star is compiled with a regular expression RE2, the sub-DFA representing the match of RE2 is duplicated: one instance will start at state 0 and one instance will start at the special state.*

  • Combining Multiple Reg-ExWhen more regular expressions containing .* conditions are compiled, the number of possible special state combinations will affect the complexity and the size of the resulting DFA.The same considerations hold for extended automata. In fact, counting/consuming states behave like special states (they have an auto-loop on a large character class).

    *

  • Combining Multiple Reg-Ex

    When processing the input text, the head-DFA is always active: each input character will trigger a state transition on it.

    The operation of distinct tail-DFA machines is, in principle, independent. Furthermore, once the head-DFA has activated a tail-DFA, the two can execute in sequence or in parallel threads.

    [20] M. Becchi and P. Crowley, A Hybrid Finite Automaton for Practical Deep Packet Inspection, in CoNEXT 2007.*

  • Combining Multiple Reg-Ex*The tail-DFA will be activated every time the head-state 0-3 is traversed.

  • Combining Multiple Reg-ExA distinct activation of the tail-DFA is required each time head-state 1-3 is reached. Therefore, the tail-DFA may be active in parallel on different states.

    New activation of the tail-DFA always begins from the counting state 3, this ensures that, if the tail-DFA is already active, the new activation will be covered by the current active state.*

  • Architecture*Head-DFATail-DFA

  • Experimental Evaluation*

    **