29
Fast Submatch Extraction using OBDDs Liu Yang 1 , Pratyusa Manadhata 2 , William Horne 2 , Prasad Rao 2 , Vinod Ganapathy 1 Rutgers University 1 HP Laboratories 2

Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Embed Size (px)

Citation preview

Page 1: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Fast Submatch Extraction using OBDDs

Liu Yang1, Pratyusa Manadhata2, William Horne2,

Prasad Rao2, Vinod Ganapathy1

Rutgers University1

HP Laboratories2

Page 2: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Applications of Regular Expressions

Signatures

Network traffic

Alerts

NIDS

Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures.

Page 3: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Applications of Regular Expressions (cont.)

Connectors (rule set) SIEM

Web security compliance

Email security compliance

Security information and event management (SIEM) systems employ regular expressions to normalize event logs generated by hardware connectors and software systems.

Page 4: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Submatch Extraction

…username=(.*), hostname=(.*) …

Rule set

username=Bob, hostname=Foo

Submatch extraction

$1 = Bob, $2 = Foo

Page 5: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Signature Matching

• Non-deterministic finite automaton (NFAs)– Space efficient, time inefficient

• Deterministic finite automaton (DFAs)– Time efficient, states blow-up

• Recursive backtracking– Fast in general– Vulnerable to algorithmic complexity attacks

Page 6: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Motivation: Time/Space Tradeoff

Space

Time

IdealDFA (deterministic finite automaton)

NFA (non-deterministic finite automaton)

Backtracking

Our approach

Page 7: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Our Contributions

• A novel way of annotating capturing groups, tagged-NFAs

• Design of a novel technique on submatch extraction (called Submatch-OBDD)– Extending Thompson’s algorithm– Using Boolean functions to represent tagged-NFAs– Using ordered binary decision diagrams (OBDDs)

to improve time efficiency

• Evaluation and comparison with RE2 and PCRENote: RE2 is a hybrid approach, using a mix of DFA/NFA, while PCRE uses recursive backtracking.

Page 8: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Solution Overview

RegExps with capturing groups

Tagged-NFAs

Boolean Representations

OBDD representations

Page 9: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

NFA Representation of RegExps

E = a*aa

Current state (x) Input symbol (i) Next state (y)

1 a 1

1 a 2

2 a 3

NFA of regexp “a*aa”

Transition table T(x,i,y)

Page 10: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Submatch Tagging: tagged NFAsE = (a*)aa

Current state (x) Input symbol (i) Next state (y) Output tags (t)

1 a 1 {t1}

1 a 2 {}

2 a 3 {}

Tagged NFA of “(a*)aa” with submatch tagging t1

Extended transition table T(x,i,y,t) of the tagged NFA

/ t1

Tag(E) = (a*)t aa1

Page 11: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Match TestRegExp=(a*)aa; Input: aaaa

1

2

3

a a a a

{1} {1,2} {1,2,3} {1,2,3} {1,2,3}

{t1} {t1} {t1} {t1}

accept

Frontier

Page 12: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Submatch Extraction

1

2

3

a a a a

{t1} {t1} {t1} {t1}

accept

{1} {1,2} {1,2,3} {1,2,3} {1,2,3}Frontier

Any path from an accept state to a start state generates a valid assignment of submatches.

$1=aa

Page 13: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Complexity of Tagged NFAs

)( lnO )( lnO

Match test: Submatch extraction: n – size of tagged NFAl – length of input string

Can we make the operations faster?

Page 14: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Submatch-OBDD

• Representing tagged NFAs using Boolean functions– Updating frontiers in one-step using a single

Boolean formula

• Using OBDDs to manipulate Boolean functions

Page 15: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Transitions as Boolean Functions

Current state (x) Input symbol (i) Next state (y) Output tag (t)

1 a 1 {t1}

1 a 2 {}

2 a 3 {}

T(x,i,y,t) = (1 Λ a Λ 1 Λ t1)V (1 Λ a Λ 2 Λ{})V (2 Λ a Λ 3 Λ{})

RegExp: (a*)aa

Page 16: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Match Test using Boolean Functions

{1} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1)V (1ΛaΛ 2 Λ{})

{1,2} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λ t1)V (1ΛaΛ 2 Λ{})V (2ΛaΛ 3 Λ{})

{1,2,3} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1)V (1ΛaΛ 2 Λ{})V (2ΛaΛ 3 Λ{})

Input symbol

Start states

Transition table

Intermediate transitions

Next states

Current states

Accept

aaaa

aaaa

aaaa

Page 17: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Submatch Extraction using Boolean Functions

(1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})

aΛ3 Λ

Accept state

The last input symbol

Intermediate transitions [4]

2ΛaΛ3Λ{}

Previous state of 3

aΛ2Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})

1ΛaΛ2Λ{}

Rename previous state as current state and continue

No output submatch tag

No output submatch tag

Intermediate transitions [3]

Previous state of 2

Start from the last symbol, going backwards

aaaa

aaaa

Page 18: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Submatch Extraction using Boolean Functions

aΛ1Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})

1ΛaΛ1Λ t1

Output submatch tag

aΛ1Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})

1ΛaΛ1Λ t1

Output submatch tag

aaaa

t1 t1

$1=aa

Intermediate transitions [2]

Intermediate transitions [1]

Previous state of 1

Previous state of 1

aaaa

aaaa

Page 19: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

More Formal: Match Test

)),,,(

)(

)(( ,,

tyixionTransFunct

xFrontier

ilInputSymboMap tixxy

Finding new frontiers after processing an input symbol:

Next frontiers =

Checking acceptance:

))()(( xFrontierxesAcceptStatSAT

Page 20: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

More Formal: Submatch Extraction

)(

))((

)),,,(

)(

)((

,,

,,

neTransitioOneRreversOutputTag

neTransitioOneRreversMapatepreviousSt

tyixsitionsIntermTran

ilInputSymbo

yteCurrentStaPickOne

neTransitioOneRrevers

yix

tyiyx

Submatch extraction: the last consecutive sequence of characters that are assigned with ti

A back traversal approach: starting from the last input symbol.

Page 21: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Submatch-OBDD

• Representation of tagged NFAs, match test, and submatch extraction using OBDDs

• OBDD representations for– Transitions with submatch tags– Intermediate transitions– Submatch tags– Set of start states– Set of accept states– Set of frontiers– Input symbols

Page 22: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Implementation

RE2TNFA TNFA2OBDD PATTERNMATCHRegExps

Tagged NFAs OBDDs

Input strings / network traffic

Matched at reg#Submatches $1= …, $2 = …

No match

Toolchain in C++, interfacing with the CUDD*

*CUDD is a package for manipulation of Binary Decision Diagrams

Page 23: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Feasibility Study

• Data sets– Snort-2009

• RegExps: 115 regexps with capturing groups from HTTP rules• Traces

– 1.2GB department network traffic (average packet size 126 bytes)– 1.3GB Twitter traffic (average packet size 1202 bytes)– 1MB synthetic trace (average string length 311 bytes)

– Snort-2012• RegExps: 403 regexps with capturing groups from HTTP rules• Traces

– 1.2GB department network traffic (average packet size 126 bytes)– 1.3GB Twitter traffic (average packet size 1202 bytes)– 1MB synthetic trace (average string length 689 bytes)

– Firewall-504• RegExps: 504 patterns from a commercial firewall F• Trace: 87MB of firewall logs (average line size 87 bytes)

Page 24: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Experimental Setup

• Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM

• Two configurations on pattern matching– Conf. S

• patterns compiled individually• Compiled pattern matched sequentially against

input traces

– Conf.C• patterns combined with UNION and compiled• combined pattern matched against input traces

Page 25: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Performance

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2009 data set

Page 26: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Performance

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2012 data set

Page 27: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Performance

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Firewall-504 data set

Page 28: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Related Work

• NFA-OBDD [Yang et al., RAID’10, Chasaki and Wolf, ANCS’10]

• RE2 [Cox, code.google.com/p/re2]• PCRE [www.pcre.org]• TNFA [Laurikari et al., SPIRE’00]• MDFA [Yu et al., ANCS’06]• Hybrid FA [Becchi and Crowley, CoNEXT’07]• XFA [Smith et al., Oakland’08]• More – see paper for details

Page 29: Fast Submatch Extraction using OBDDs Liu Yang 1, Pratyusa Manadhata 2, William Horne 2, Prasad Rao 2, Vinod Ganapathy 1 Rutgers University 1 HP Laboratories

Conclusion

• A novel way of annotating capturing groups

• Submatch-OBDD: a novel technique on submatch extraction using OBDDs

• Feasibility study– Submatch-OBDD achieves ideal performance

when patterns are combined– Faster than RE2 and PCRE when patterns

are combined