Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection Sailesh Kumar Sarang Dharmapurikar Fang Yu Patrick Crowley Jonathan

Algorithms to Accelerate Multiple Regular Expressions

Matching for Deep Packet Inspection

Sailesh KumarSarang Dharmapurikar

Fang YuPatrick CrowleyJonathan Turner

Presented by: Sailesh Kumar

2 - Sailesh Kumar - 04/21/23

Overview

Why regular expressions acceleration is important?

Introduction to our approach» Delayed Input DFA (D2FA)

D2FA construction

Simulation results

Memory mapping algorithm

Conclusion


Why Regular Expressions Acceleration?

RegEx are now widely used» Network intrusion detection systems, NIDS» Layer 7 switches, load balancing» Firewalls, filtering, authentication and monitoring» Content-based traffic management and routing

RegEx matching is expensive»Space: Large amount of memory»Bandwidth: Requires 1+ state traversal per byte

RegEx is performance bottleneck»In enterprise switches from Cisco, etc»Cisco security appliances

– Use DFA, 1+ GB memory, still sub-gigabit throughput

»Need to accelerate RegEx!


Can we do better?

Well studied in compiler literature» What’s different in Networking?» Can we do better?

Construction time versus execution time (grep)» Traditionally, (construction + execution) time is the metric» In networking context, execution time is critical» Also, there may be thousands of patterns

DFAs are fast» But can have exponentially large number of states» Algorithms exist to minimize number of states» Still 1) low performance and 2) gigabytes of memory

How to achieve high performance?» Use ASIC/FPGA

– On-chip memories provides ample bandwidth– Volume and need for speed justifies custom solution

» Limited memory, need space efficient representation!


Introduction to Our Approach

How to represent DFAs more compactly?» Can’t reduce number of states» How about reducing number of transitions?

– 256 transitions per state– 50+ distinct transitions per state (real world datasets)– Need at least 50+ words per state

Three rulesa+, b+c, c*d+

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

4 transitionsper state

Look at state pairs: there are many common transitions.How to remove them?


Introduction to Our Approach

How to represent DFAs more compactly?» Can’t reduce number of states» How about reducing number of transitions?

– 256 transitions per state– 50+ distinct transitions per state (real world datasets)– Need at least 50+ words per state

Three rulesa+, b+c, c*d+

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

4 transitionsper state

AlternativeRepresentation

d

c

a

b

d

c

a

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

Fewer transitions,less memory


D2FA Operation

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

1 3

a

2

5

4

cc

b

d

Input stream: a b d DFA and D2FA visits thesame accepting state after

consuming a character

Heavy edges are called default transitionsTake default transitions, whenever, a labeled transition is missing

DFA D2FA


D2FA Operation

1 3

a

a

a b

b

2

5

4

cb

b

c

d

d

d

c

d

c

a

b

d

c

a

1 3

a

2

5

4

cc

b

d

Any set of default transitions will suffice ifthere are no cycles of default transitions

Thus, we need to construct trees of default transitions

So, how to construct space efficient D2FAs?while keeping default paths bounded

2

1

3

4

d

c

b

2

1

3

4

c

b

d

a

5 5

a

c

c

Above two set of default transitions trees are also correctHowever, we may traverse 2 default transitions to consume a character

Thus, we need to do more work => lower performance


D2FA Construction

Present systematic approach to construct D2FA Begin with a state minimized DFA Construct space reduction graph

» Undirected graph, vertices are states of DFA» Edges exist between vertices with common transitions» Weight of an edge = # of common transitions - 1

2

1 3 b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

2

1

3

4

5

3

3

3

2

32

2

2

3

3


D2FA Construction

Convert certain edges into default transitions» A default transition reduces w transitions (w = wt. of edge)» If we pick high weight edges => more space reduction» Find maximum weight spanning forest» Tree edges becomes the default transitions

Problem: spanning tree may have very large diameter» Longer default paths => lower performance

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

2

1

3

4

5

3

3

3

2

32

2

2

3

3

# of transitionsremoved = 2+3+3+3=11 root


D2FA Construction

We need to construct bounded diameter trees» NP-hard» Small diameter bound leads to low trees weight

– Less space efficient D2FA» Time-space trade-off

We propose heuristic algorithm based upon Kruskal’s algorithm to create compact bounded diameter D2FAs

2

1 3b

4

5

a

d

a

c

a b

d

a

c

b

cb

b

a

c

d

d

d

c

2

1

3

4

5

3

3

3

2

32

2

2

3

3


D2FA Construction

Our heuristic incrementally builds spanning tree» Whenever, there is an opportunity, keep diameter small» Based upon Kruskal’s algorithm» Details in the paper


Results

We ran experiments on» Cisco RegEx rules» Linux application protocol classifier rules» Bro rules» Snort rules (subset of rules)

Size of DFA versus D2FA (No default path length bound applied)

Original DFA D2FA Normal spanning tree Refined spanning tree DFA # of

states Total # of transitions # of transitions %

reduction Max. default

length # of transitions %

reduction Max. default

length

Cisco590 17713 4.5M 36k 99.2 57 36k 99.2 17

Cisco103 21050 5.3M 53k 99.0 54 53k 99.0 19

Cisco7 4260 1.0M 28k 97.4 61 28k 97.4 23

Linux56 13953 3.5M 58k 98.3 30 58k 98.3 21

Linux10 13003 3.3M 285k 91.3 20 285k 91.3 17

Snort11 41949 10.7M 168k 98.4 9 168k 98.4 6

Bro648 6216 1.5M 7k 99.5 17 7k 99.5 8


Space-Time Tradeoff

0.001

0.01

0.1

1 2 3 4 5 6 7

Bound on default path length

No

rma

lize

d D

2F

A s

ize

Cisco103

Linux56

Snort11

Bro648

Longer default path => more work but less space

Space efficient region

Default paths have length 4+Requires 4+ memory accesses per character

We propose memory architectureWhich enables us to consume

one character per clock cycle


Summary of Memory Architecture

We propose an on-chip ASIC architecture» Use multiple embedded memories to store the D2FA

– Flexibility– Frequent changes to rules

D2FA requires multiple memory accesses» How to execute D2FA at memory clock rates?

We have proposed deterministic contention free memory mapping algorithm» Uniform access to memories» Enables D2FA to consume a character per memory access» Nearly zero memory fragmentation

– All memories are uniformly used Details and results in paper

At 300 MHz we achieve 5 Gbps worst-case throughput


Conclusion

Deep packet inspection has become challenging» RegEx are used to specify rules» Wire speed inspection

We presented an ASIC based architecture to perform RegEx matching at 10’s of Gigabit rates

As suggested in the public review, this paper is not the final answer to RegEx matching» But it is a good start

We are presently developing techniques to perform fast RegEx matching using commodity memories» Collaborators are welcome!!!


Thank you and Questions?


Backup Slides


D2FA Construction

Our heuristic incrementally builds spanning tree» Whenever, there is an opportunity, keep diameter small» Details in the paper

Graph with 31 states, max. wt. default transition tree» Our heuristic creates smaller default paths

1

2

3

4

5

6

7

8

9

1 0

11

1 2 1 3

1 4

1 5

1 6

1 7

1 8

1 9

2 0

2 1

2 2

2 3

2 4

2 5

2 6

2 7

2 82 9

3 0

3 1

1

2

3

45

6

7

8

9

1 0

11

1 2 1 31 4

1 5

1 6

1 7

1 8

1 9

2 0

2 1

2 2

2 3

2 4

2 5

2 62 7

2 8

2 9

3 0

3 1

Kruskal’s algorithm,Max. default path = 8 edges

Our refined Kruskal’s algorithm,Avg. default path = 5 edges


Multiple Memories

To achieve high performance, use multiple memories and D2FA engines

Multiple memories provide high aggregate bandwidth Multiple engines use bandwidth effectively

» However, worst case performance may be low– No better than a single memory

» May need complex circuitry to handle contention We propose deterministic contention free memory

mapping and compare it to a random mapping

Me mo ry Me mo ry Me mo ry Me mo ry....

D2FAs c anner

D2FAs c anner

D2FAs c anner

....


Memory Mapping

The memory mapping algorithm can be modeled as a graph coloring» Graph is the set of default transition trees» Colors represent the memory modules» Color nodes of the trees such that

– Nodes along a default path are colored with different colors– All colors are uniformly used

We propose two methods, naïve and adaptive

4

2

4

33 233 3

3

2 1

4

1

1

3

3

23 43 2

2

4

1

4

1 1

2

3

41

Naïve coloring Adaptive coloring


Results

Adaptive mapping leads to much more uniform color usage» Memories are uniformly used, little fragmentation» Up to 20% space saving with adaptive coloring

Throughput results (300 MHz dual-port eSRAM)

0

2

4

6

8

10

1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Number of concurrently scanned packets

Thr

ough

put

(Gbp

s)

a ve ra g e p e rfo rma n ce

Syn th e tica lly g e n e ra te dwo rst-ca se in p u t d a ta

Adaptive mappingRandomized mapping

Documents

Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection Sailesh Kumar Sarang Dharmapurikar Fang Yu Patrick Crowley Jonathan