Upload
tamyra
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms. Sailesh Kumar Advisors: Jon Turner, Patrick Crowley Committee: Roger Chamberlain, John Lockwood, Bob Morley. Focus on 3 Network Features. In this proposal, we focus on 3 network features Packet payload inspection - PowerPoint PPT Presentation
Citation preview
Doctoral Dissertation Proposal: Acceleration of
Network Processing Algorithms
Sailesh Kumar
Advisors: Jon Turner, Patrick CrowleyCommittee: Roger Chamberlain, John
Lockwood, Bob Morley
2 - Sailesh Kumar - 04/22/23
Focus on 3 Network Features In this proposal, we focus on 3 network features
Packet payload inspection» Network security
Packet header processing» Packet forwarding, classification, etc
Packet buffering and queuing» QoS
3 - Sailesh Kumar - 04/22/23
Overview of the Presentation Packet payload inspection
» Previous work– D2FA and CD2FA
» New ideas to implement regular expressions» Initial results
IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA
Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing
Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist
4 - Sailesh Kumar - 04/22/23
Delayed Input DFA (D2FA), SIGCOMM’06 Many transitions in a DFA
» 256 transitions per state» 50+ distinct transitions per state (real world datasets)» Need 50+ words per state
Can we reduce the number of transitions in a DFA
Three rulesa+, b+c, c*d+
2
1 3b
4
5
a
d
a
c
a b
d
a
c
b
cb
b
a
c
d
d
d
c
4 transitionsper state
Look at state pairs: there are many common transitions.How to remove them?
5 - Sailesh Kumar - 04/22/23
Delayed Input DFA (D2FA), SIGCOMM’06 Many transitions in a DFA
» 256 transitions per state» 50+ distinct transitions per state (real world datasets)» Need 50+ words per state
Can we reduce the number of transitions in a DFA
Three rulesa+, b+c, c*d+
1 3
a
a
a b
b
2
5
4
cb
b
c
d
d
d
c
4 transitionsper state
AlternativeRepresentation
d
c
a
b
d
c
a
1 3
a
a
a b
b
2
5
4
cb
b
c
d
d
d
c
d
c
a
b
d
c
a
Fewer transitions,less memory
6 - Sailesh Kumar - 04/22/23
D2FA Operation
1 3
a
a
a b
b
2
5
4
cb
b
c
d
d
d
c
d
c
a
b
d
c
a
1 3
a
2
5
4
cc
b
d
Heavy edges are called default transitionsTake default transitions, whenever, a labeled transition is missing
DFA D2FA
7 - Sailesh Kumar - 04/22/23
D2FA versus DFA D2FAs are compact but requires multiple memory
accesses» Up to 20x increased memory accesses» Not desirable in off-chip architecture
Can D2FAs match the performance of DFAs» YES!!!!» Content Addressed D2FAs (CD2FA)
CD2FAs require only one memory access per byte» Matches the performance of a DFA in cacheless system» Systems with data cache, CD2FA are 2-3x faster
CD2FAs are 10x compact than DFAs
8 - Sailesh Kumar - 04/22/23
Introduction to CD2FA, ANCS’06 How to avoid multiple memory accesses of D2FAs?
» Avoid lookup to decide if default path needs to be taken» Avoid default path traversal
Solution: Assign labels to each state, labels contain:» Characters for which it has labeled transitions» Information about all of its default states» Characters for which its default states have labeled transitions
find node Rat location R
R
c
da
b
all
ab,cd,R
cd,R
R
V
U find node U athash(c,d,R)
find node V athash(a,b,hash(c,d,R))
ContentLabels
9 - Sailesh Kumar - 04/22/23
Introduction to CD2FAR
c
d
all
ab,cd,R
cd,R
R
V
U
Input char =
hash(a,b,hash(c,d,R))
Z
l
mP
q
all
X
Ypq,lm,Z
lm,Z
hash(c,d,R)
Current state: V (label = ab,cd,R)
hash(p,q,hash(l,m,Z))
ab
d a
(R, a)(R, b)…(Z, a)(Z, b)…
lm,Zpq,lm,Z
(X, p)(X, q)
(V, a)(V, b)
→ X (label = pq,lm,Z)
10 - Sailesh Kumar - 04/22/23
Construction of CD2FA We seek to keep the content labels small
Twin Objectives:» Ensure that states have few labeled transitions» Ensure that default paths are as small as possible
Proposed new heuristic called CRO to construct CD2FA» Details in ANCS’06 paper» Default path bound = 2 edges => CRO algorithm constructs
upto 10x space efficient CD2FAs
11 - Sailesh Kumar - 04/22/23
Memory Mapping in CD2FAR
c
d
all
ab,cd,R
cd,R
R
V
U
Z
l
mP
q
all
X
Ypq,lm,R
lm,R
ab
(R, a)(R, b)…(Z, a)(Z, b)…
WE HAVE ASSUMEDTHAT HASHING ISCOLLISION FREE
hash(a,b,hash(c,d,R))hash(c,d,R))hash(p,q,hash(l,m,Z))
COLLISION
12 - Sailesh Kumar - 04/22/23
Collision-free Memory Mapping
aab
c
pq
r
lm
n
de
f
b c , ….
p q r , ….
n , ….
d e f , ….
hash(abc, …)
hash(def, …)
hash(pqr, …)
hash(lmn, …)
hash(edf, …)
l m hash(mln, …)Add edges for allPossible choices
Four states
4 memorylocations
13 - Sailesh Kumar - 04/22/23
Bipartite Graph Matching Bipartite Graph
» Left nodes are state content labels» Right nodes are memory locations» An edge for every choice of content label» Map state labels to unique memory locations» Perfect matching problem
With n left and right nodes» Need O(logn) random edges» n = 1M implies, we need ~20 edges per node
If we provide slight memory over-provisioning» We can uniquely map state labels with much fewer edges
In our experiments, we found perfect matching without memory over-provisioning
4
5
2
6
1
3
2
4
C o nte nt M e m o ry labe l addre s s
14 - Sailesh Kumar - 04/22/23
Reg-ex – New Directions Three Key problems with traditional DFA based reg-ex
matching» 1. Employ complete signature to parse input data
– Even if normal data matches only a small prefix portion– Full signature => large DFA
» 2. Only one active state of execution and no memory about the previous matches– Combinations of partial matches requires new DFA states
» 3. Inability to count certain sub-expressions– E.g. a{1024} will require 1024 DFA states
We aim at addressing each of these problems in the proposed research
15 - Sailesh Kumar - 04/22/23
Addressing the First Problem Divide the processing into fast and slow path
Split the signature into prefix and suffix» employ signature prefixes in fast path» Upon a match in fast path, trigger the slow path» Appropriate splitting can maintain low triggering rate
Benefits:» Fast path can employ a composite DFA for all prefixes
– Due to small prefixes composite DFA will remain small– Higher parsing rate
» Slow path will use separate DFA for each signature– No state explosion in slow path– Due to low triggering rate, slow path will not become a bottleneck
» Reduces per-flow state– Fast path uses composite DFA, one active state per flow
16 - Sailesh Kumar - 04/22/23
Fast and Slow Path Processing Here we assume that ε fraction of the flows are
diverted to the slow path Fast path stores a per flow DFA state Slow path may store multiple active states
Fas t pathauto m ato n
Fas t paths tate
m e m o ry
B bits /s e c
S lo w pathauto m ata
S lo w path m e m o ry
C
sta te f
C
sta te s
B bits /s e c
17 - Sailesh Kumar - 04/22/23
Splitting Reg-exes Splitting can be performed based upon data traces Assign probability to NFA states and make a cut so
that slow path cumulative probability is low
1 2 5d g
^g
0 g -h
*
3 e
6 7 10a g
^i
f 8 j9i
11 12 15g -h ia 13 c14a -e
^ l
^ j1 .0
0 .2 5 0 .2 0 .0 1 0 .0 0 1
0 .1 0 .0 1 0 .0 0 8 0 .0 0 6 0 .0 0 0 6
0 .1 0 .0 2 0 .0 1 6 0 .0 0 8 0 .0 0 0 8
C UT
0
1 .0
0
1 .0
*
*
s lo w p a th au to m ataf as t p a th au to m ato n
r1 = .*[gh]d[^g]*ger2 = .*fag[^i]*i[^j]*jr3 = .*a[gh]i[^l]*[ae]c
Cumulative probability of slow path = 0.05
18 - Sailesh Kumar - 04/22/23
Splitting Reg-exes
1 2 5d g
^g
0 g -h
*
3 e
6 7 10a g
^ i
f 8 j9i
11 12 15g -h ia 13 c14a -e
^ l
^ j1 .0
0 .2 5 0 . 2 0 .0 1 0 .0 0 1
0 . 1 0 .0 1 0 .0 0 8 0 . 0 0 6 0 .0 0 0 6
0 .1 0 .0 2 0 .0 1 6 0 .0 0 8 0 .0 0 0 8
C UT
0
1 .0
0
1 .0
*
*
s lo w p a th au to m atafas t p a th au to m ato n
Slow path will comprise of three separate DFAs, one for each signature
Fast path will containa composite DFA (14 states)p1 = .*[gh]d[^g]*gp2 = .*fap3 = .*a[gh]i
0
0 , 1g ,h
^ g ,h
d 0 , 2 0 , 1 , 2
0 , 1 , 3
g g
0 , 5 e
h
^ d ,g ,h
^ d ,e ,g ,h*
^g
"sta rt sta te"
g ,h
g ,h dr1 = .*[gh]d[^g]*ger2 = .*fag[^i]*i[^j]*jr3 = .*a[gh]i[^l]*[ae]c
Notice the start state
19 - Sailesh Kumar - 04/22/23
Protection against DoS Attacks An attacker can attack such system by sending data
that match the prefixes more often than provisioned» Slow path will become the bottleneck
Solution: Look at the history and determine if a flow is an attack flow or not» Compute anomaly index: weighted moving average of the
number of times a flow has triggered the slow path» If a flow has high anomaly index, send it to a low rate queue
Slo w pathauto m ata
pe r - f lo wano m alyc o unte r
C
B pkts /s e c::
k
Fas t pathauto m ato n
B pkts /s e c
Ho L b u f f e r
s lo w paths le e p s tatus
20 - Sailesh Kumar - 04/22/23
Initial Simulation Results
0
5
10
15
20
25
1 26 51 76 101 126 151 176 201 226 251
Thro
ughp
ut, n
o D
oS p
rote
ctio
n
012345
1 26 51 76 101 126 151 176 201 226 251
Slow
pat
h lo
ad
0
5
10
15
20
25
1 26 51 76 101 126 151 176 201 226 251
Flow
thro
ughp
ut. D
oS p
rote
ctio
n
s lo w p a th 's th r es h o ld
N o o v er lo ad in g M o d er ate o v er lo ad in g E x tr em e o v er lo ad in g
tim e ( s ec o n d s )
21 - Sailesh Kumar - 04/22/23
Addressing the Second Problem NFA: compact but O(n) active states DFA: 1 active state but state explosion
» How to avoid state explosion while also keeping the per-flow active state information small
Propose a novel machine called History based Finite Automaton or H-FA» Augment a DFA with a history buffer» Transitions are taken looking at the history buffer contents» During certain transitions, items are inserted/removed from
the history buffer
Claim: a small history buffer is sufficient to avoid state explosion and also keep a single active state
22 - Sailesh Kumar - 04/22/23
Example of H-FA Construction
1 2 3b c
^ a
4 5 6e f
0
N FA : a b [^a ]* c; d e f
0
0 ,4
0 ,1 0 ,2b
a
ad
0 ,5e
0 ,2 ,4
a
e
0 , 3c
d 0 ,2 ,5 f 0 ,2 ,6
0 , 6f
a
dd
^[a d ]c c c
DFANFA state 2 is present in 4 DFA states.If remove the NFA state 2 from theseDFA states, then we will have just 6 states
23 - Sailesh Kumar - 04/22/23
H-FA
0
0 ,4
0 ,1 0 ,2b
a
ad
0 ,5e
0 ,2 ,4
a
e
0 , 3c
d 0 ,2 ,5 f 0 ,2 ,6
0 , 6f
a
dd
^[a d ]c c c
DFANFA state 2 is present in 4 DFA states.If remove the NFA state 2 from theseDFA states, then we will have just 6 states
0
0 ,4
d
0 ,1
a
d
0 ,5e
0 , 3
d
0 , 6f
a
d
d
b , fla g <=1
a , fla g < =0c ,if fla g =1, fla g <= 0
a , fla g <=0
c , fla g =0
fl a g
This new machine uses a history flag inaddition to its transitions to make moves
24 - Sailesh Kumar - 04/22/23
H-FA
0
0 ,4
d
0 ,1
a
d
0 ,5e
0 , 3
d
0 , 6f
a
d
d
b , fla g <=1
a , fla g < =0c ,if fla g =1, fla g <= 0
a , fla g <=0
c , fla g =0
fl a g
This new machine uses a history flag inaddition to its transitions to make moves
0
3,0set is flag because
c 4,0
dreset
0 is flag because
c
Input data = c d a b c
reset
flag
1,0
a
flagset
0
b
25 - Sailesh Kumar - 04/22/23
H-FA In general, if we maintain a flag for each NFA state
that represents a Kleene closure, we can avoid any state explosion
k closures will require at most k-bits in history buffer
There are some challenges associated with the efficient implementation of conditional transitions» We plan to work on these in the proposed research
26 - Sailesh Kumar - 04/22/23
Addressing the Third Problem
ab[^a]{1024}cdef 0
0 ,4
d
0 ,1
a
d
0 ,5e
0 , 3
d
0 , 6f
a
d
d
b , fla g < =1
a , fla g < =0c ,if fla g =1, fla g < =0
a , fla g < =0
c , fla g =0
fl a g
Replace flag by a counterReplace flag=1 condition with ctr=1024Replace flag=0 condition with ctr=0Increment ctr if ctr>0; reset when ctr reaches 1024
One of the primary goals of research to enable efficient implementation of counter conditions
27 - Sailesh Kumar - 04/22/23
Early Results
DFA Composite H-FA / H-cFA Source # of closures, # of length restriction
# of automata
total # of states
# of flags in history
# of counters in history
Total # of states
Max # of transitions /
character
Total # of transitions
% space reduction
with H-FA
H-FA parsing rate speedup
Cisco64 14, 1 1 132784 6 0 3597 2 1215450 94.69 -
Cisco64 14, 1 1 132784 13 0 1861 8 682718 96.77 -
Cisco68 19, 1 1 328664 17 0 2956 8 1337293 97.03 -
Snort rule 1 6, 6 3 62589 5 6 583 8 238107 97.40 3x
Snort rule 2 1, 2 1 12703 1 2 71 2 27498 98.58 -
Snort rule 3 5, 1 2 4737 5 1 116 4 46124 93.48 2x
Linux70 11, 0 2 20662 9 0 1304 8 546378 81.63 2x
28 - Sailesh Kumar - 04/22/23
Overview of the Presentation Packet payload inspection
» Previous work– D2FA and CD2FA
» New ideas to implement regular expressions» Initial results
IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA
Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing
Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist
29 - Sailesh Kumar - 04/22/23
IP Address Lookup Routing tables at router input ports
contain (prefix, next hop) pairs Address in the packet is compared to
the stored prefixes, starting at left. Prefix that matches largest number of
address bits is desired match. Packet is forwarded to the specified
next hop.
1* 500* 301* 5
0* 7
001* 2011* 31011* 4
prefix nexthop
routing table
address: 0110 0100 1000
30 - Sailesh Kumar - 04/22/23
Address Lookup Using Tries Prefixes stored in
“alphabetical order” in tree. Prefixes “spelled” out by
following path from top.»green dots mark prefix ends
To find best prefix, spell out address in tree.
Last green dot marks longest matching prefix.
address: 0110 0100 1000
10 0
1
1
1
1
0
3
1* 500* 301* 5
0* 7
001* 2011* 31011* 4
1
31 - Sailesh Kumar - 04/22/23
Pipelined Trie-based IP-lookup
Each level in different stage → overlap multiple packets
Tree data-structure, prefixes in leaves (leaf pushing)Process IP address level-by-level to find the longest match
P4 = 10010*
10
10
0
1
0
P1 P2 P4P3 P5
1 P6 P7
Stages of different size:- Requires more memory- Largest stage becomes the bottleneck
32 - Sailesh Kumar - 04/22/23
Circular Pipeline, ANCS’06 Use circular pipeline and allow requests to enter/exit
at any stage Mapping:
» Divide the trie into multiple sub-tries» Map each sub-trie with its root starting at different stage
33 - Sailesh Kumar - 04/22/23
P3
P4
P5
P6
P7
P8
P1P2
00* Enter at pipelinestage 1
01* Enter at pipelinestage 2
10* No Match
11* Enter at pipelinestage 3
Pipeline stage 3 Pipeline stage 4
Pipeline stage 2 Pipeline stage 1
1
0
10
10
0
1 1
P1
0 1
0 1
0 1
0 1
11
1
0 0 1
P2 P3
P4 P5
P6 P7 P8
0 1
0 1
11
0 0 1P1
P2 P3
P4 P5
P6 P7 P8
00* Begin at Subtree 101* Begin at Subtree 210* No Match11* Begin at Subtree 3
Subtree 1 Subtree 2 Subtree 3
P1
00* P1000* P2010* P3
01001* P401011* P5011* P6110* P7111* P8
Direct index table handlingthe first 2-bit of the address
P1
Divide the trieinto three sub-tires
Mapping in Circular Pipeline
34 - Sailesh Kumar - 04/22/23
Circular Pipeline Benefits:
» Uniform stage sizes» Less memory – no over-provisioning is needed in face of
arbitrary trie shape» Higher throughput
35 - Sailesh Kumar - 04/22/23
New Direction: HEXA HEXA (History-based Encoding, eXecution and
Addressing)» Challenges the assumption that graph structures must store
log2n bits pointers to identify successor nodes
If labels of the path leading to every node is unique then these labels can be used to identify the node» In tries every node has a unique path starting at the root node» Thus, labels along the path will become the identifier of the
node» Note that these labels need not be explicitly stored
36 - Sailesh Kumar - 04/22/23
Traditional ImplementationAddr data1 0, 2, 32 0, 4, 53 1, NULL, 64 1, NULL, NULL5 0, 7, 86 1, NULL, NULL7 0, 9, NULL8 1, NULL, NULL9 1, NULL, NULL
0 1
0 1
0
0
1* P100* P211* P3
011* P40100* P5
1
2 3
54
7
9
P 2
(a)(b )
P 5
1
6
P 31
8
P 4
P 1
There are nine nodes; we will need 4-bit node identifiersTotal memory = 9 x 9 bits
37 - Sailesh Kumar - 04/22/23
HEXA based Implementation
0 1
0 1
0
0
1* P100* P211* P3
011* P40100* P5
1
2 3
54
7
9
P 2
(a)(b )
P 5
1
6
P 31
8
P 4
P 1
Define HEXA identifier of a node as the path which leads to it from the root
1. -2. 03. 1
4. 005. 016. 11
7. 0108. 0119. 0100
Notice that these identifiers are uniqueThus, they can potentially be mapped tounique memory address
38 - Sailesh Kumar - 04/22/23
HEXA based Implementation
0 1
0 1
0
0
1* P100* P211* P3
011* P40100* P5
1
2 3
54
7
9
P 2
(a)(b )
P 5
1
6
P 31
8
P 4
P 1
Use hashing to map the HEXA identifier to memory address
1. -2. 03. 1
4. 005. 016. 11
7. 0108. 0119. 0100
If we have a minimal perfect hash function f -A function that maps elements to unique location
Then we can store the trie as shown below
f(010) = 5f(011) = 3f(0100) = 6
f(-) = 4f(0) = 7f(1) = 9
f(00) = 2f(01) = 8f(11) = 1
Addr Fast path Prefix1 1,0,0 P32 1,0,0 P23 1,0,0 P44 0,1,15 0,1,06 1,0,0 P57 0,1,18 0,1,19 1,0,1 P1
Here we use only3-bits per nodein fast path
39 - Sailesh Kumar - 04/22/23
Devising One-to-one Mapping Finding a minimal perfect hash function is difficult
» One-to-one mapping is essential for HEXA to work
Use discriminator bits» Append c-bits to every HEXA identifier, that we can modify» Thus a node can have 2c choices of identifiers» Notice that we need to store these c-bits, thus more than just
3-bits per node are needed
With multiple choices of HEXA identifiers for a node, we can reduce the problem, to a bipartite graph matching problem» We need to find a perfect matching in the graph to map nodes
to unique memory locations
40 - Sailesh Kumar - 04/22/23
Devising One-to-one Mapping
-
0
1
00
01
11
010
011
0100
00 0, 01 0,10 0, 11 0
00 1, 01 1,10 1, 11 1
00 -, 01 -,10 -, 11 -
00 00, 01 00,10 00, 11 00
00 01, 01 01,10 01, 11 01
00 11, 01 11,10 11, 11 11
00 010, 01 010,10 010, 11 010
00 011, 01 011,10 011, 11 011
00 0100, 01 0100,10 0100, 11 0100
0
1
2
3
4
5
6
7
8
h() = 0, h() = 4h() = 1, h() = 5
h() = 1, h() = 5h() = 2, h() = 6
h() = 0, h() = 4h() = 1, h() = 5
h() = 2, h() = 6h() = 3, h() = 7
h() = 1, h() = 5h() = 2, h() = 6
h() = 8, h() = 3h() = 0, h() = 4
h() = 1, h() = 5h() = 2, h() = 6
h() = 0, h() = 4h() = 1, h() = 5
h() = 0, h() = 3h() = 4, h() = 6
Input labels Four choices ofHEXA identifiers
Choices ofmemory locations
Bipartite graph anda perfect matching
1
2
3
4
5
6
7
8
9
Nodes
41 - Sailesh Kumar - 04/22/23
Initial Results Our initial evaluation suggests that 2-bits
discriminators are enough to find a perfect matching» Thus 2-bits per node is enough instead of log2n bits
0
4
8
12
16
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
Number of nodes in the trie
Num
ber o
f HEX
A id
entif
ier c
hoic
es
no memory over-provisioning1% memory over-provisioning3% memory over-provisioning10% memory over-provisioning
42 - Sailesh Kumar - 04/22/23
Initial Results Memory comparison to Eatherton’s trie
In future» Complete evaluation of HEXA based IP lookup: throughput, die
size and power estimate» Extend HEXA to string and finite automaton
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6Stride
Fast
pat
h tr
ie m
emor
y (M
B)
without HEXA
with HEXA
43 - Sailesh Kumar - 04/22/23
Overview of the Presentation Packet payload inspection
» Previous work– D2FA and CD2FA
» New ideas to implement regular expressions» Initial results
IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA
Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing
Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist
44 - Sailesh Kumar - 04/22/23
Hash Tables Suppose our hash function gave us the
following values:» hash("apple") = 5
hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2
» hash("honeydew") = 6
This is called collision» Now what
kiwi
bananawatermelon
applemango
cantaloupegrapes
strawberry
0123456789
45 - Sailesh Kumar - 04/22/23
Collision Resolution Policies Linear Probing
»Successively search for the first empty subsequent table entry
Linear Chaining»Link all collided entries at any bucket as a linked-list
Double Hashing»Uses a second hash function to successively index
the table
46 - Sailesh Kumar - 04/22/23
Performance Analysis Average performance is O(1) However, worst-case performance is O(n) In fact the likelihood that a key is at a distance
> 1 is pretty high
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
10 20 30 40 50 60 70 80 90 100Load m/n (%)
Prob
abilit
y
ke y d is t a n c e > 1
ke y d is ta n c e > 2
These keys will take twice time to be
probed
These will take thrice the time to be
probed
Pretty high probability that throughput is half or three times lower than the peak
throughput
47 - Sailesh Kumar - 04/22/23
Segmented Hashing, ANCS’05 Uses power of multiple choices
» has been proposed earlier by Azar et. al A N-way segmented hash
» Logically divides the hash table array into N equal segments» Maps the incoming keys onto a bucket from each segment» Picks the bucket which is either empty or has minimum keys
k i
h( ) k i is mappedto this bucket
k i+1
h( )k i+1 is mappedto this bucket
2 1 1 1 2 1 21 2
A 4-way segmented hash table
12
48 - Sailesh Kumar - 04/22/23
Segmented Hash Performance More segments improves the probabilistic
performance» With 64 segments, probability that a key is inserted at
distance > 2 is nearly zero even at 100% load» Improvement in average case performance is still modest
1E-15
1E-12
1E-09
1E-06
1E-03
1E+00
10 20 30 40 50 60 70 80 90 100Load m/n (%)
Prob
. {ke
y di
stan
ce >
1}
1 s e g me n t4
16
64
32
8
1E-15
1E-12
1E-09
1E-06
1E-03
1E+00
10 20 30 40 50 60 70 80 90 100Load m/n (%)
Prob
. {ke
y di
stan
ce >
2} 1 s e g me n t
4
16
32
8
49 - Sailesh Kumar - 04/22/23
Adding per Segment Filters
0
1
0
2 1 1 1 2 0 1 21 2
k ih( ) k i can go to any of the 3 buckets
1
0
0
0
0
1
1
0
1
h1(ki)h2(ki)
hk(ki)
:
mb bits
We can select any of the above three segments and insert the key into the
corresponding filter
50 - Sailesh Kumar - 04/22/23
Selective Filter Insertion Algorithm
0
1
0
k ih( )
2 1 1 1 2 0 1 21 2
k i can go to any of the 3 buckets
1
0
0
0
0
1
1
0
1
h1(ki)h2(ki)
hk(ki)
:
mb bits
Insert the key into segment 4, since fewer bits are set. Fewer
bits are set => lower false positive
With more segments (or more choices), our
algorithm sets far fewer bits in the Bloom filter
51 - Sailesh Kumar - 04/22/23
Problem with Segmented Hash Bloom filter size is proportional to the total number of
elements An O(1) lookup can be maintained even if we omit the
Bloom filter of one segment» With many segments and each of equal size, this omission will
not lead to much reduction in Bloom filter size
An alternative is to use segments of different sizes and omit the Bloom filter in the largest segment» If the largest segment is say 90% of the total memory, then
this will result in 90% reduction in the Bloom filter size» Peacock hashing utilizes this property
52 - Sailesh Kumar - 04/22/23
Peacock Hashing
K(actual keys)
U(universe of keys)
k1
k3k 4
k6k5
k7
k1
k5
k6
k7
k4
h5( )h4( )h3`( )h2( )h1( )
k2
k2
k3
Size of 1st segment = 1Size of second segment = c Size of ith segment = c x size of i-1st segment
No element will be discardedUntil the first segment is filled
53 - Sailesh Kumar - 04/22/23
Peacock Hash Use Bloom filter for all segments but the largest
segment» Thus, for c = say 10, the Bloom filter will be 10x smaller
Lookup is obvious» First consult all Bloom filters» If none of them shows a membership, then lookup in the
largest segment» Else lookup into the segments which shows a membership
In order to enable deletes we require counting Bloom filters, but counters can be kept in slow path
Deletes however lead to imbalance in the loading
54 - Sailesh Kumar - 04/22/23
Peacock Hash A series of “delete and insert” may lead to overflow of
the smaller segments
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150Simulation time (sampling interval is 1000)
Discardrate (%)
Segment 54 3
Segment 6
Second phase begins
2
1
55 - Sailesh Kumar - 04/22/23
Peacock Hash Following every delete we perform a re-balancing, i.e.
search the smaller segments and move elements to larger segment if possible
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150Simulation time (sampling interval is 1000)
Discardrate (%)
Segment 5 43
Segment 6
Second phase begins
2
56 - Sailesh Kumar - 04/22/23
Issues and Future Directions It is not clear, how to perform rebalancing efficiently
» In the previous simulation, we use a brute force approach and search the entire segment, leading to O(n) rebalancing cost
Complicating factors:» Collision length higher than 1 in some segments» Double hashing collision policy» Use of 2-ary hashing may improve the efficiency, but will again
complicate the re-balancing
Future Research Objectives:» Develop efficient re-balancing algorithm» Develop Bloom filters which better utilizes the power of
multiple choices» Extend the scheme to memory segments with different
bandwidth and access latency
57 - Sailesh Kumar - 04/22/23
Overview of the Presentation Packet payload inspection
» Previous work– D2FA and CD2FA
» New ideas to implement regular expressions» Initial results
IP Lookup» Tries and pipelined tries» Previous work: CAMP» New direction: HEXA
Hashing used for packet header processing» Why do we need better hashing?» Previous work: Segmented Hash» New direction: Peacock Hashing
Packet buffering and queuing» Previous work: multichannel packet buffer, aggregated buffer» New direction: DRAM based buffer, NP based queuing assist
58 - Sailesh Kumar - 04/22/23
Packet Buffering and Queuing First objective is to extend the multichannel packet
buffer architecture to DRAM memories We also plan to consider memories with different size,
bandwidth and access latency» Extension of
– Sailesh Kumar, Patrick Crowley, and Jonathan Turner, "Design of Randomized Multichannel Packet Storage for High Performance Routers", Proceedings of IEEE Symposium on High Performance Interconnects (HotI-13), Stanford, August 17-19, 2005.
Work on a NP specific queuing hardware assist» Extension of
– Sailesh Kumar, John Maschmeyer, and Patrick Crowley, "Queuing Cache: Exploiting Locality to Ameliorate Packet Queue Contention and Serialization", Proceedings of ACM International Conference on Computing Frontiers (ICCF), Ischia, Italy, May 2-5, 2006.
59 - Sailesh Kumar - 04/22/23
The proposed research is expected to take one year
Acknowledgments» Jon Turner» Patrick Crowley» Michela Becchi» Sarang Dharmapurikar» John Lockwood» Roger Chamberlain» Robert Morley» Balakrishnan Chandrasekaran» Michael Mitzenmacher, Harvard Univ.» George Varghese, UCSD» Will Eatherton, Cisco» John Williams, Cisco
60 - Sailesh Kumar - 04/22/23
Questions???