14
124 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014 Scalable Packet Classification for Datacenter Networks Pi-Chung Wang Abstract—The key challenge to a datacenter network is its scalability to handle many customers and their applications. In a datacenter network, packet classification plays an important role in supporting various network services. Previous algorithms store classification rules with the same length combinations in a hash table to simplify the search procedure. The search performance of hash-based algorithms is tied to the number of hash tables. To achieve fast and scalable packet classification, we propose an algorithm, encoded rule expansion, to transform rules into an equivalent set of rules with fewer distinct length combinations, without affecting the classification results. The new algorithm can minimize the storage penalty of transformation and achieve a short search time. In addition, the scheme supports fast incremental updates. Our simulation results show that more than 90% hash tables can be eliminated. The reduction of length combinations leads to an improvement on speed performance of packet classification by an order of magnitude. The results also show that the software implementation of our scheme without using any hardware parallelism can support up to one thousand customer VLANs and one million rules, where each rule consumes less than 60 bytes and each packet classification can be accomplished under 50 memory accesses. Index Terms—Packet classification, datacenter network, scala- bility, router architectures, packet forwarding, firewalls, VLANs. I. I NTRODUCTION D ATACENTERS house computing resources in controlled environments with centralized management. These com- puting resources include various servers, storage subsystems, and the network infrastructure, including IP or storage-area network. Applications of datacenters range from handling core business and operational data of an organization to external e- commerce and business-to-business applications. While some larger organizations still manage internal data centers, many organizations have outsourced their computing infrastructures to off-premises data centers and third-party cloud computing providers. To address the emerging challenges of datacenters, a dat- acenter network that interconnects the computing resources should be scalable, low-latency, high-throughput and secure. The network topology in a datacenter is typically organized as a three-layer hierarchy (Fig. 1) to interconnect tens of thousands of computing devices running thousands of appli- cations and services [1]. With the diverse applications within Manuscript received January 15, 2013; revised July 1, 2013. This work is supported in part by the National Science Council under Grant No. NSC100- 2628-E-005-007-MY3. P.-C. Wang is now with the Department of Computer Science and Engineer- ing, National Chung Hsing University, Taichung, Taiwan 402, ROC (e-mail: [email protected]). Digital Object Identifier 10.1109/JSAC.2014.140112. . . . Internet Core Aggregation Access L3 Router L2/L3 Router Firewall Load Balancer Intrusion Detection System L2 Switch L2 Switch Servers Fig. 1. Three-layer data center network topology [1]. data centers, the switching layers of a datacenter network are collapsed by consolidating layer-2 networks with 10-gigabit Ethernet (10gE). In addition to the basic services for network connectivity, a datacenter network also implements quality of services, load balancing, access control lists, firewalls, and network intrusion detection/prevention, mostly all of which are located in the aggregation layer. These functions fulfill infrastructure and security services with application aware- ness. Since a datacenter network could be shared by multiple tenants, it is reasonable to provide network virtualization by slicing network resources. Last but not least, a datacenter net- work should be energy efficient to lower electricity expense. A flexible network device that can consolidate all network services is thus desired. Device consolidation also reduces packet latency. Packet classification is an elementary operation for fulfilling the services of datacenter networks. It categorizes disordered packets into meaningful traffic flows of various Internet appli- cations based on predefined rules (or policies) [2]. Each rule defines the actions to be performed on the matching packets. For some services such as firewall and load balancing, a packet classifier only applies the action of the first matching rule. For deep packet inspection, a packet classifier returns all actions of the matching rules [3], [4]. To consolidate multiple services in a network device, a packet classifier must return all the required actions of different services. In the last decade, much effort has gone into characterizing rules that resulted in a number of RAM-based algorithmic solutions. The speed performance of packet classification also benefits from hardware parallelism based on ternary content addressable memories (TCAMs). TCAMs have superior per- 0733-8716/14/$31.00 c 2014 IEEE

Scalable Packet Classification for Datacenter Networks

Embed Size (px)

Citation preview

124 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014

Scalable Packet Classification forDatacenter Networks

Pi-Chung Wang

Abstract—The key challenge to a datacenter network is itsscalability to handle many customers and their applications. Ina datacenter network, packet classification plays an importantrole in supporting various network services. Previous algorithmsstore classification rules with the same length combinationsin a hash table to simplify the search procedure. The searchperformance of hash-based algorithms is tied to the number ofhash tables. To achieve fast and scalable packet classification,we propose an algorithm, encoded rule expansion, to transformrules into an equivalent set of rules with fewer distinct lengthcombinations, without affecting the classification results. The newalgorithm can minimize the storage penalty of transformationand achieve a short search time. In addition, the scheme supportsfast incremental updates. Our simulation results show that morethan 90% hash tables can be eliminated. The reduction of lengthcombinations leads to an improvement on speed performanceof packet classification by an order of magnitude. The resultsalso show that the software implementation of our schemewithout using any hardware parallelism can support up to onethousand customer VLANs and one million rules, where eachrule consumes less than 60 bytes and each packet classificationcan be accomplished under 50 memory accesses.

Index Terms—Packet classification, datacenter network, scala-bility, router architectures, packet forwarding, firewalls, VLANs.

I. INTRODUCTION

DATACENTERS house computing resources in controlledenvironments with centralized management. These com-

puting resources include various servers, storage subsystems,and the network infrastructure, including IP or storage-areanetwork. Applications of datacenters range from handling corebusiness and operational data of an organization to external e-commerce and business-to-business applications. While somelarger organizations still manage internal data centers, manyorganizations have outsourced their computing infrastructuresto off-premises data centers and third-party cloud computingproviders.

To address the emerging challenges of datacenters, a dat-acenter network that interconnects the computing resourcesshould be scalable, low-latency, high-throughput and secure.The network topology in a datacenter is typically organizedas a three-layer hierarchy (Fig. 1) to interconnect tens ofthousands of computing devices running thousands of appli-cations and services [1]. With the diverse applications within

Manuscript received January 15, 2013; revised July 1, 2013. This work issupported in part by the National Science Council under Grant No. NSC100-2628-E-005-007-MY3.

P.-C. Wang is now with the Department of Computer Science and Engineer-ing, National Chung Hsing University, Taichung, Taiwan 402, ROC (e-mail:[email protected]).

Digital Object Identifier 10.1109/JSAC.2014.140112.

. . .

InternetCore

Aggregation

Access

L3 Router

L2/L3 Router

Firewall

Load BalancerIntrusion Detection SystemL2 Switch

L2 Switch

Servers

Fig. 1. Three-layer data center network topology [1].

data centers, the switching layers of a datacenter network arecollapsed by consolidating layer-2 networks with 10-gigabitEthernet (10gE). In addition to the basic services for networkconnectivity, a datacenter network also implements quality ofservices, load balancing, access control lists, firewalls, andnetwork intrusion detection/prevention, mostly all of whichare located in the aggregation layer. These functions fulfillinfrastructure and security services with application aware-ness. Since a datacenter network could be shared by multipletenants, it is reasonable to provide network virtualization byslicing network resources. Last but not least, a datacenter net-work should be energy efficient to lower electricity expense.A flexible network device that can consolidate all networkservices is thus desired. Device consolidation also reducespacket latency.

Packet classification is an elementary operation for fulfillingthe services of datacenter networks. It categorizes disorderedpackets into meaningful traffic flows of various Internet appli-cations based on predefined rules (or policies) [2]. Each ruledefines the actions to be performed on the matching packets.For some services such as firewall and load balancing, a packetclassifier only applies the action of the first matching rule. Fordeep packet inspection, a packet classifier returns all actions ofthe matching rules [3], [4]. To consolidate multiple servicesin a network device, a packet classifier must return all therequired actions of different services.

In the last decade, much effort has gone into characterizingrules that resulted in a number of RAM-based algorithmicsolutions. The speed performance of packet classification alsobenefits from hardware parallelism based on ternary contentaddressable memories (TCAMs). TCAMs have superior per-

0733-8716/14/$31.00 c© 2014 IEEE

WANG: SCALABLE PACKET CLASSIFICATION FOR DATACENTER NETWORKS 125

formance for the first-match rules, but they are not scalable dueto its limited space [5]. They also suffer from degrading eitherspeed or storage performance for supporting multi-match1

packet classification [3], [6] and consume more energy.In this paper, we present a scalable RAM-based algorithmic

solution to meet the requirements of virtualization and con-solidation in datacenter networks. Network virtualization isachieved by employing VLANs with rule-based path isolation[7]. Our algorithm can handle rules with VLAN identifiers(VIDs) without degrading performance, including for thescenarios with dual VIDs, one for service provider and anotherfor customers. The algorithm can provide a high degree ofconsolidation by supporting more than one million rules withthousands of VLANs. Our main algorithm, encoded ruleexpansion, aims at revising the prefix length combinations ofthe rules with low storage penalty. By combining the conceptsof rule encoding and rule expansion, the algorithm reducesrule replications to drastically improve storage efficiency. Wealso present a greedy algorithm for minimizing the numberof hash tables. Our experimental results show that the speedperformance is logarithmically proportional to the number ofrules.

The rest of the paper is organized as follows. Section IIdescribes related work. Section III motivates our idea toimprove the performance of hash-based packet classification.In Section IV, we introduce our algorithm of encoded ruleexpansion which serves to reduce the cost of revising prefixlength combinations. Section V presents a greedy algorithmthat employs the encoded rule expansion algorithm to reducethe number of hash tables with minimal cost. The searchand update procedures are introduced in Section VI. In Sec-tion VII, an experimental setup is provided and a detailedevaluation of the proposed scheme is performed. Finally,Section VIII states our conclusion.

II. RELATED WORK

The rules for packet classification consist of a set of fieldsand an action. Each field, in turn, corresponds to one field ofpacket headers. The value in each field could be a variable-length prefix, range, explicit value or wildcard. The mostcommon fields include a source/destination IP address prefix,a source/destination port range of the transport protocol anda protocol type in a packet header. Formally, a rule R withd fields is defined as R = (f1, f2, . . . , fd). While performingpacket classification, a packet header P is said to match aparticular rule R if ∀i, the ith field of the header satisfiesfi. Both hardware and software based solutions have beenproposed to efficiently yield the matching rules.

Hardware proposals usually use parallel lookups to accel-erate the search procedure, but hardware resources, such asternary content addressable memory (TCAM) devices, maylimit the size of supported rule sets [2]. Numerous TCAM-based algorithms have been proposed to address the prob-lems of range representation, power consumption and storagerequirement. While all three problems mentioned above arerelated, most of the existing algorithms can reduce TCAMentries. A smaller TCAM also has better energy efficiency

1In [4], a different term, all-match, is used.

and shorter search latency. The existing algorithms can becategorized into three types: power saving [8]–[10], range en-coding [10]–[16], and minimization [17]–[20]. The algorithmsfor range encoding minimize the number of TCAM entries forrange representation. In [21], the authors demonstrated that themaximum cost of range expansion is W , where W is the widthof the range field. The problem of range expansion can bealleviated by range encoding. TCAM minimization algorithmsare designed for single-matching packet classification, whichmeans only the first matching entry is reported. Some low-priority rules are useless, because they are not reported byTCAM for any packets. Several TCAM minimization algo-rithms are then proposed to remove these redundant entries.

To support multi-match packet classification, Yu et al. [3]yielded all intersections between the rules to keep track ofall matching rules. This scheme is fast, but requires a largenumber of TCAM entries. SSA [22] reduces the numberof TCAM entries of geometrical intersection based on theobservation that the number of intersections can be signifi-cantly reduced by splitting a rule database into several subsets.Lakshminarayanan et al. [6] stored a discriminator with eachrule to prune the previously matched rule. Each multi-matchpacket classification required multiple TCAM accesses. As aresult, the TCAM-based solutions for multi-match packet clas-sification either worsened the storage overhead or sacrificedthe speed performance.

Software algorithms have better scalability; however, theirdata structures for storing rules may significantly affect theperformance. These data structures include bit vectors [23],[24], decision trees [5], [25]–[29], crossproducting [30], [31],and hash tables [32], [33]. The algorithms based on bit vectorsand crossproducting need large memory space. Decision treehas been regarded as an efficient data structure for packetclassification. The existing decision-tree algorithms may usedifferent approaches to partitioning a set of rules, such asHiCuts [25], Modular Packet Classification [26], HyperCuts[27] and CubeCuts [28]. For example, HiCuts selects one fieldto divide address space into equal partitions and HyperCutsmay use more than one field for space partitioning. The rulesthat occupied more than one divided address partitions arereplicated. Thus, the cut rules for space partitioning affect thedegree of rule replication. To alleviate rule replication, BSOL[29], EffiCuts [5] and the algorithm in [34] employ multipledecision trees to store rules. The decision trees of [34] alsosupports 12-tuple rules of OpenFlow [35]. Tuple space search(TSS) [32] stores rules in hash tables according to prefixlength combinations. The rules of a group are stored in a hashtable. The matching rules can be yielded by probing all hashtables. Several algorithms have been proposed to reduce thenumber of accessed hash tables. For example, Pruned TupleSpace Search (PTSS) correlates each hash table with the per-field specifications in its rules so that the hash tables withoutall matching field specifications can be excluded [32]. Fortwo-field rules, pre-computation information and markers areembedded into the hash tables to improve search performance[32], [36]. In [33], a rule-encoding scheme based on prefixnesting (or prefix containment [24]) is presented to reduce thenumber of hash tables. A recent work generates hash tables forthe rounded down IP address prefixes [37]. Each hash entry

126 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014

stores the original prefixes and has a pointer to the other rulefields for further comparison.

III. MOTIVATION

The goal of this work is to improve the search performanceof hash-based packet classification. As mentioned above, TSScategorizes rules according to their prefix length combinations;therefore, the search procedure of packet classification in-volves producing all matching entries from a set of hash tables.The existing algorithms use various techniques to improveTSS search performance. These algorithms either use extraentries with pre-computation information to enclose accessedthe hash tables [32], [36] or encode rules to change theelements in a hash table [33]. None of the existing algorithmsare capable of controlling the number of hash tables, the mainbottleneck for the search performance of hash-based packetclassification2. Accordingly, we fill this gap by proposingseveral algorithms to efficiently control the number of hashtables.

The main idea of our scheme consists of two parts. In thefirst part, a storage-efficient algorithm for expanding a rule toa designating length combination is presented. This approachis based on rule encoding to minimize the cost of per-fieldprefix expansion. Since a generic five-field rule consists of twoarbitrary port ranges, we also describe an efficient approachto transforming a range into a single prefix. The second partminimizes the cost of reducing the number of length combi-nations of a rule database. A greedy algorithm is presented toproperly select length combinations for expansion.

IV. ENCODED RULE EXPANSION

In this section, we introduce the proposed algorithm, en-coded rule expansion. We first present our approach to encod-ing a rule with range fields. Next, the expansion algorithm forthe encoded rule is presented.

A. Encoding Rules with Non-prefix Fields

The approach to encoding rules with non-prefix fieldsis extended from [33]. To begin with, we provide a briefbackground information on rule encoding in a tuple spacesearch and then point out the importance of additional work.In [33], the number of hash tables is reduced by encodingrules based on prefix nesting (or prefix containment [24])because no IP address prefix contains more than five nestingprefixes in the existing rule databases [24], [30]. Consequently,five distinct lengths are enough to reflect the hierarchy ofthe prefixes specified in the rules rather than the maximumlength of each field (e.g., 32 bits for IP addresses). Therule encoding approach only supports prefix fields such assource and destination IP prefixes. It is possible to construct atwo-dimensional tuple space by using only both prefix fieldsto avoid the cost of range-to-prefix conversion. However,administrators usually apply multiple policies to a subnetworkpair [2], [30]. The rules share the same source and destination

2Although rule encoding can reduce the number of hash tables, the numberof hash tables is determined by the number of nested prefixes.

TABLE IAN EXAMPLE WITH TWELVE RULES ON FIVE FIELDS.

Rule f1 f2 f3 f4 f5 ActionR0 000∗ 111∗ [10 : 10] ∗ UDP act0R1 000∗ 111∗ [01 : 01] [10 : 10] UDP act0R2 000∗ 10∗ ∗ [10 : 10] TCP act1R3 000∗ 10∗ [00 : 10] [01 : 01] TCP act2R4 000∗ 10∗ [10 : 10] [11 : 11] TCP act1R5 0∗ 111∗ [10 : 10] [01 : 01] UDP act0R6 0∗ 111∗ [10 : 10] [10 : 10] UDP act0R7 0∗ 1∗ ∗ ∗ TCP act2R8 ∗ 01∗ [00 : 10] ∗ TCP act2R9 ∗ 0∗ ∗ [01 : 01] UDP act0R10 ∗ ∗ ∗ ∗ UDP act3R11 ∗ ∗ ∗ [01 : 11] TCP act4

prefixes will be stored in the same entry of a hash table toresult in severe hash collisions.

We convert a range into its longest enclosure prefix. Ourapproach is premised on a restriction that enforces a com-parison to the content of the matching hash-table entry toavoid mismatch when accessing a hash table. Therefore, itis allowable to use less-precise information to generate a hashindex. We note that a similar technique has also been used in[38] to minimize the cost of range representation in TCAM;however, this technique only supports disjoint ranges due tothe single-match behavior of TCAM. In contrast, hash-basedpacket classification has no such limitation.

The longest enclosure prefix of a port range is the commonbits of the start and the end addresses of the range. For exam-ple, the typical decimal range specification [1024 : 65535] isenclosed by the prefix 〈∗〉. Another decimal range specification[20 : 24] is represented by the prefix 〈000000000001∗〉. Thelongest enclosure prefix is then used to define the tuple andto generate the hash table index where the original rule isstored. As each range is transformed into a prefix with awider range, the search for a specific range must also match itsenclosure prefix to ensure that there is no false negative. Whena hash-table entry is accessed, the original rule is comparedto determine whether there is a matching rule.

Let us consider an example with twelve rules on five fields(listed in Table I), where the values of the prefix and rangefields are represented in binary form. Each rule has two rangefields, f3 and f4, which must be converted to prefixes. Theserules can be transformed into fifteen rules with thirteen lengthcombinations by using the basic range-to-prefix conversion.After our approach is applied, the number of the converted ruleremains the same as the original rules, as listed in Table II,and the number of length combinations is reduced to nine.

Two or more rules which are only different in their rangefields may result in duplicate rules if their ranges are con-verted into the same enclosure prefixes. This problem can beeliminated by converting a range into multiple prefixes if itcauses too many duplicated rules.

In summary, our approach ensures that each rule withrange fields can be transformed into exactly one rule withprefix fields only, whereas the basic approach results in anaverage rule expansion factor, 2.32 [2]. Our approach providesa storage-efficient solution for hash-based algorithms that also

WANG: SCALABLE PACKET CLASSIFICATION FOR DATACENTER NETWORKS 127

Prefixes*0000

f1Prefixes*100111

f4

C

Prefixes*001110111

A

C

F

E

B D

f2Prefixes*1001

B C

A

f3 f5PrefixesUDPTCP

A

CBB C D

A

A

B 0

UDP TCP000

0

01

01 01

10

1

111 10 10 11

Fig. 2. Per-field Binary Tries for the Rules in Table II.

TABLE IICONVERTED RULES USING THE PROPOSED RANGE-TO-PREFIX

CONVERSION.

Rule f1 f2 f3 f4 f5 ActionR0 000∗ 111∗ 10∗ ∗ UDP act0R1 000∗ 111∗ 01∗ 10∗ UDP act0R2 000∗ 10∗ ∗ 10∗ TCP act1R3 000∗ 10∗ ∗ 01∗ TCP act2R4 000∗ 10∗ 10∗ 11∗ TCP act1R5 0∗ 111∗ 10∗ 01∗ UDP act0R6 0∗ 111∗ 10∗ 10∗ UDP act0R7 0∗ 1∗ ∗ ∗ TCP act2R8 ∗ 01∗ ∗ ∗ TCP act2R9 ∗ 0∗ ∗ 01∗ UDP act0R10 ∗ ∗ ∗ ∗ UDP act3R11 ∗ ∗ ∗ ∗ TCP act4

results in fewer length combinations, thereby enhancing thesearch performance of hash-based packet classification.

Hereafter, all the discussions are based on the assumptionthat each rule has been converted into the pure-prefix formbefore any further processing takes place.

Let us now consider the procedure for rule encoding. Anencoded rule consists of d encoded prefixes. To generate en-coded prefixes, a binary trie of each field is constructed. For anestablished binary trie, each prefix node is visited in a depth-first search order and assigned to a unique identifier. Eachencoded prefix is then generated by sequentially concatenatingthe identifiers on the path from the root to the correspondingprefix node. Each prefix in a rule is then replaced by theencoded prefix to generate the encoded rule.

The example in Table II illustrates the procedure. Theconstructed binary tries are shown in Fig. 2. Each prefix nodeis assigned to an identifier. In this example, the identifiers aredenoted in capital letters. Then, the prefixes specified by therules are replaced by the concatenated identifiers. Considerthe prefix 〈000∗〉 in f1 which corresponds to node C in theleftmost binary trie in Fig. 2. The prefix 〈000∗〉 is encodedas 〈ABC∗〉 as a result of concatenating the identifiers on thepath from the root to the corresponding node. The encodedprefix is then used to replace the original prefix. The ruleswhose prefix fields have been replaced are listed on the leftin Table III. After encoding the rules, the number of distinctlengths on each field is equal to the number of prefix nestings.With the reduced number of distinct lengths in each field, thelength combinations of the rules are reduced as well. In ourexample, the number of length combinations is further reducedfrom nine to eight. We list these eight length combinationswith their tuple identifiers and rules in Table III and IV.

TABLE IIIENCODED RULES FOR THE RULES IN TABLE II.

Rule f1 f2 f3 f4 f5 ActionR0 ABC∗ ADF∗ AC∗ A∗ AB act0R1 ABC∗ ADF∗ AB∗ AC∗ AB act0R2 ABC∗ ADE∗ A∗ AC∗ AC act1R3 ABC∗ ADE∗ A∗ AB∗ AC act2R4 ABC∗ ADE∗ AC∗ AD∗ AC act1R5 AB∗ ADF∗ AC∗ AB∗ AB act0R6 AB∗ ADF∗ AC∗ AC∗ AB act0R7 AB∗ AD∗ A∗ A∗ AC act2R8 A∗ ABC∗ A∗ A∗ AC act2R9 A∗ AB∗ A∗ AB∗ AB act0R10 A∗ A∗ A∗ A∗ AB act3R11 A∗ A∗ A∗ A∗ AC act4

TABLE IVNEW TUPLE SPACE FOR THE RULES IN TABLE III.

Tuple Length Combination RulesT0 (3, 3, 2, 1, 2) R0

T1 (3, 3, 2, 2, 2) R1, R4

T2 (3, 3, 1, 2, 2) R2, R3

T3 (2, 3, 2, 2, 2) R5, R6

T4 (2, 2, 1, 1, 2) R7

T5 (1, 3, 1, 1, 2) R8

T6 (1, 2, 1, 2, 2) R9

T7 (1, 1, 1, 1, 2) R10, R11

B. Encoded Prefix Expansion

Encoded prefix expansion is an active approach to changingthe length combination of a prefix. The concept of prefixexpansion is evolved from the prefix expansion techniquepresented in [39]. The prefix expansion technique expandsone-dimensional prefixes from its original length to a longerone. Assume prefix p is equal to 〈b0, b1, . . . , bl(p)−1〉, whoselength is l(p). By expanding the length of p to a longer lengthl′, the set P of the generated prefixes can be expressed as

P = {p0, p1, . . . , p2l′−l(p)−1}, (1)

where pi = 〈b0, b1, . . . , bl(p)−1〉⊕

string(i, l′ − l(p)), 0 ≤i ≤ 2l

′−l(p)−1. The operator “⊕

” performs string concatena-tion, and string(v, l) returns the l-bit binary string of value v.The size of the set ‖P‖ is thus equal to 2l

′−l(p). For example,the prefix 〈∗〉 can be expanded to either two 1-bit prefixes(〈0∗〉 and 〈1∗〉), four 2-bit prefixes (〈00∗〉 ∼ 〈11∗〉), or eight3-bit prefixes (〈000∗〉 ∼ 〈111∗〉).

The cost of expanding a prefix greatly affects thecost of expanding a rule. To expand rule r with prefixfields (p1, p2, . . . , pd) from the original length combination{l(p1), l(p2), . . . , l(pd)} to the length combination of another

128 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014

TABLE VEXPANDED RULES FROM R2 IN TABLE II.

(000∗, 100∗, 00∗, 10∗,TCP), (000∗, 101∗, 00∗, 10∗,TCP)(000∗, 100∗, 01∗, 10∗,TCP), (000∗, 101∗, 01∗, 10∗,TCP)(000∗, 100∗, 10∗, 10∗,TCP), (000∗, 101∗, 10∗, 10∗,TCP)(000∗, 100∗, 11∗, 10∗,TCP), (000∗, 101∗, 11∗, 10∗,TCP)

tuple t, {lt1, lt2, . . . , ltd}, the set of the generated rules, R, isexpressed as

R = {r1, r2, . . . , r‖F1‖×‖F2‖×...×‖Fd‖}= F1

⊗F2

⊗. . .

⊗Fd (2)

where Fi, 1 ≤ i ≤ d, is the set of the generated prefixeson the ith field, ‖Fi‖ = 2l

ti−l(pi), and

⊗is a cross-product

operator. As the number of the expanded rules is exponentiallyproportional to the number of the expanded fields, the prefixexpansion technique is not suitable for rules that expand morethan one field.

Let us continue with the example in Table II. To expandthe length combination of R2 from {3,2,0,2,8} to {3,3,2,2,8},8(= 20 × 21 × 22 × 20 × 20) rules are generated, as shown inTable V. The number of the generated rules grows to 256(=22 × 22 × 22 × 22 × 20) for expanding R7 to {3,3,2,2,8}.If we expand all rules to the length combination {3,3,2,2,8},the number of rules would drastically increase to 2,048, afterremoving the duplicate rules.

We found that the prefix expansion technique may produceadditional prefixes which have not originally existed. Sincethe expanded prefixes must match the same rules, these newprefixes can be categorized into different equivalence classes,all of which have a unique set of potentially matching rules.Each original prefix has a different equivalence class whoserules include those specifying the original prefix and itssubprefixes. Therefore, the equivalence class of a new prefixcan be yielded by its best-matching original prefix.

For the example in Table II, expanding the length of theprefix 〈∗〉 in f3 of R2 to two bits generates four prefixes,where two of them, 〈00∗〉 and 〈11∗〉, are nonexistent in theoriginal rules. Both prefixes have the same best-matchingoriginal prefix, 〈∗〉. In other words, they both belong to thesame equivalence class of f3 = 〈∗〉, which includes sevenrules, R2, R3, R7 ∼ R11. The encoding technique allows allnew prefixes that belong to the same equivalence class to berepresented by one encoded prefix.

We now describe the basic procedure for generating best-matching encoded prefixes for all possible new prefixes. First,each encoded prefix is denoted as ep and its length as l(ep),where the length of an encoded prefix is measured by thenumber of node identifiers. For a rule field, we determine themaximum length of the encoded prefixes, max ep length.Next, we generate an equivalence class for each encodedprefix whose length is shorter than max ep length. In eachequivalence class, we generate (max ep length − l(ep))encoded prefixes as the length of the encoded prefixes can beexpanded to max ep length. Each generated encoded prefixhas a unique length ranged from l(ep) to max ep length.For the equivalence class of a prefix ep, we further determinewhether there exists at least one address whose best-matching

original prefix is ep. If yes, the new encoded prefixes ofthis equivalence class are kept for the subsequent encodedprefix expansion. Otherwise, these encoded prefixes are safelyremoved because none address matches them.

We embed a new data structure, termed complementarynode, for identifying each equivalence class in the binarytrie. In the following procedure, we present the steps togenerate the complementary node for each equivalence class indetail. As complementary nodes are not defined in the originalbinary trie, we add an extra pointer, called complementarynode branch, to the data structure of original trie node. Eachcomplementary node branch is initialized as NULL.

1) Label each prefix node as an � value, which is equalto the number of the prefix nodes on the path, fromitself to the trie root. The maximum � in the binary trieis defined as max �. Note that the value � is equal tol(ep), where ep is the corresponding encoded prefix.

2) Generate a complementary node for each prefix node,whose � value is smaller than max �. The � value of thecomplementary node is equal to that of the correspond-ing prefix node plus one. Each complementary node alsomaintains a counter to determine whether there is anyaddress space not covered by a longer prefix.

3) Traverse the binary trie in depth-first search order.Point each empty left or right branch of a node tothe complementary node of it’s immediate parent prefixnode or to the associated complementary node if it is aprefix node. The counter of the pointed complementarynode is increased by one.

4) Except for the complementary nodes generated for leafprefix nodes, eliminate the other complementary nodeswhose counter values are zero because these comple-mentary nodes are not pointed at by any branch. Thecomplementary nodes generated for leaf prefix nodesmust be kept for potential prefix expansion.

5) For each existing complementary node whose � issmaller than max �, generate another complementarynode which is pointed at by the current complementarynode. The new complementary node is then numberedas (�+1). Repeat this step until no complementary nodecan be inserted.

Fig. 3 further illustrates our procedure in a stepwise fashion.In Step 1, the � values of the prefix nodes are labeledin numbers. In Step 2, complementary nodes E and F aregenerated for prefix nodes A and B, respectively. In thethird step, the empty branches in the binary trie are directedtoward the complementary nodes of their immediate parentprefix nodes or toward the associated complementary nodesif they are prefix nodes. In Step 4, the reference count ofcomplementary node, E, remains zero; thus, E is removed. Inthe last step, another complementary node G, with a maximum� value is inserted to be branched by node F.

The generated complementary nodes are used to representthe nonexistent prefixes which are generated in the originalprefix expansion. In Fig. 3, the nonexistent prefixes which onlymatch 〈∗〉 rather than the other existing prefixes are logicallyidentical to the encoded prefix 〈AF∗〉, which matches 〈A∗〉instead of the other existing encoded prefixes. Therefore, allnonexistent suffixes of 〈∗〉 can be encoded as 〈AF∗〉. As the

WANG: SCALABLE PACKET CLASSIFICATION FOR DATACENTER NETWORKS 129

Node mapping to prefixes Complementary node The inserted branches

C

A

D

B

C

A

D

B

C

A

D

B

C

A

D

B

C

A

D

B

Fig. 3. Preprocessing Steps of Complementary Node Generation. In each diagram, the values of � are represented by numbers; the assigned identifiers arerepresented by alphabet letters.

Input: The root of the max �-level prefix trie of field iOutput: The prefix trie with the inserted complementary nodesINSERT (node)

IF (node = NULL) THENRETURN;

ENDIFIF ((node→ leaf == TRUE)&(node→ � < max �)) THEN

node→complementary=NEW NODE();node→complementary→reference count++;INSERT(node→complementary); // (Step 5)

ELSE IF (node→ prefix = TRUE) THENnode→complementary=NEW NODE(); // (Step 2)nearest complementary=node→complementary;IF ((node→left = NULL) ‖ (node→right = NULL)) THEN

node→complementary→reference count++; // (Step 3)INSERT(node→left);INSERT(node→right);INSERT(node→complementary);

ELSE IF (node→ complementary node = TRUE) THENIF (node→reference count = 0) THEN

DELETE(node); // (Step 4)ELSE

FOR (�=node→ �; � < max �; �++)node→complementary=NEW NODE(); // (Step 5)node=node→complementary;

ENDFORENDIF

ELSEIF (node→left = NULL) THEN // (Step 3)

node→complementary=nearest complementary;nearest complementary→reference count++;

ENDIFIF (node→right = NULL) THEN // (Step 3)

node→complementary=nearest complementary;nearest complementary→reference count++;

ENDIFINSERT(node→left);INSERT(node→right);

ENDIFRETURN;

END/* Flag leaf is TRUE if there is no other node successive to the current node. Flagprefix is TRUE if there is an existing prefix corresponding to the current node. Flagcomplementary node is TRUE if the current node is a complementary node. */

Fig. 4. Pseudo Code of Complementary Node Insertion Algorithm

maximal � value in Fig. 3 is three, another complementarynode G is appended to node F for generating the encodedprefix 〈AFG∗〉. The pseudo code for embedding the comple-mentary nodes is listed in Fig. 4. As each trie node is traversedonce in our algorithm, the time complexity is O(NW ).

After generating complementary nodes, an encoded prefixcan be expanded to a longer length. The encoded prefixexpansion is defined as: an i-length prefix can be expandedto a set of longer j-length prefixes corresponding to thesuccessive prefix and complementary nodes whose � valueis j. In other words, expanding a prefix from i identifiersto j identifiers generates a set of prefixes, whose count isequal to the number of all successive � = j prefixes andcomplementary nodes. For instance, in Fig. 3, prefix 〈A∗〉 can

A B

C

D F

E

Fig. 5. An Example of Expanding Leaf Prefixes without Generating ExtraPrefixes.

be expanded to two prefixes with two identifiers, 〈AB∗〉 and〈AF∗〉, or three 3-id prefixes, 〈ABC∗〉, 〈ABD∗〉, and 〈AFG∗〉.In the worst case, the number of generated encoded prefixesis equal to the number of all prefix nodes in the prefix trie, aseach prefix node generates, at most, one complementary nodefor each larger � value. Therefore, the space complexity ofthe encoded prefix expansion is O(N ). As compared with theoriginal prefix expansion, with a space complexity O(2W ),the encoded prefix expansion usually results in much fewerexpanded prefixes.

For the prefixes in leaf nodes, whose � < max �, theextra cost of expanding such prefixes is zero. To illus-trate further, we use the prefix tree in Fig. 5 as an ex-ample. There are four prefixes in the leaf nodes, namely,〈A∗〉, 〈BC∗〉, 〈BDE∗〉, and 〈BF∗〉. Except for the longestprefix, 〈BDE∗〉, the other leaf prefixes, 〈A∗〉, 〈BC∗〉, and〈BF∗〉, can be expanded to longer prefixes without incurringany cost penalty. For instance, prefix 〈A∗〉 can be expandedto 〈AG∗〉 or 〈AGH∗〉, and prefix 〈BF∗〉 can be expanded to〈BFK∗〉. This property saves considerable storage for a shortprefix corresponding to a leaf node. We demonstrate how therequired storage benefits from this property of the proposedscheme through experiments in Section VII.

C. Expanding Encoded Rules

Encoded rule expansion begins by expanding the prefixesof each field, according to the definition of encoded prefixexpansion. First, an original prefix of rule r, pi (1 ≤ i ≤ d),is transformed into an encoded prefix epi. The encoded rulebelongs to tuple s, with length combination {�s1, �s2, . . . , �sd},where �si is the length of the ith field. To expand the rule toanother tuple t, with length combination {�t1, �t2, . . . , �td}, theprefix of each field is expanded to the set of encoded prefixes,EPi, where epj ∈ EPi is the �ti-length prefix of the prefixand complementary nodes successive to the prefix node of pi.

130 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014

The set of the generated rules, R′, is formulated as

R′ = {r′1, r′2, . . . , r′‖EP1‖×‖EP2‖×...×‖EPd‖}= EP1

⊗EP2

⊗. . .

⊗EPd (3)

‖EPi‖ = node count(pi, �ti) (4)

where node count(pi, x) is the number of (� = x) prefixand complementary nodes successive to the prefix node of pi.With the encoded prefix expansion, the maximum number ofgenerated rules of each rule expansion is reduced from 2Wd

to Nd, where N is usually much smaller than 2W .We expand the rules in Table III to illustrate the ef-

fectiveness of the proposed scheme. The binary tries andthe generated complementary nodes of all fields are shownin Fig. 6. The original prefixes of the rules are trans-formed into the encoded prefixes, as presented in Ta-ble III. If we expand rule R2(=(ABC∗,ADE∗,A∗,AC∗,AC))to the longest combination {3,3,2,2,2}, three encoded rules,(ABC∗,ADE∗,AB∗,AC∗,AC), (ABC∗,ADE∗,AC∗,AC∗,AC),and (ABC∗,ADE∗,AD∗,AC∗,AC), are generated, which can becompared with eight rules generated using the original prefixexpansion. Moreover, the number of generated rules used inexpanding all rules to the length combination {3,3,2,2,2} is507, whereas 2,048 rules are required by the original ruleexpansion. As a result, our scheme significantly reduces thecost of rule expansion.

V. TUPLE REDUCTION OPTIMIZATION

Using the operation for encoded rule expansion describedabove, we are able to eliminate one length combinationby expanding its rules to another length combination. Inthis section, we present how the number of distinct lengthcombinations is reduced to a predefined threshold T whileminimizing the number of generated rules. Assume that themaximum encoded prefix length of field i is max �i. Thereare Γ =

∏di=1 max �i possible length combinations of en-

coded rules. Hence, minimizing the cost of rule expansion iscomputation intensive. This problem, which is analogous tothe facility location problem, can be modeled by a p-medianproblem [40]. In the following, we use the p-median modelto define our problem. Each length combination is assigned toa unique identifier from 1 to Γ. A mathematical model withone objective function and three constraints is formulated as

Minimize∑x,y∈Γ

‖R′cx,cy‖ × jcx,cy (5)

subject to∑x,y∈Γ

jcx,cy ≥ (Γ− T ), cx �= cy (6)

∑y∈Γ

jcx,cy = 1, ∀x ∈ Γ (7)

jcx,cx ≤ jcy,cx , ∀x, y ∈ Γ (8)

where ‖R′cx,cy‖ is the number of generated rules for expanding

rules in the tuple of length combination cx to length combi-nation cy . Another variable, jcx,cy , is a 0-1 variable whichindicates whether the rules in cx are expanded to cy .

The value of ‖R′cx,cy‖ is calculated using the following

equation:

‖R′cx,cy‖ =

‖cx‖∑j=1

(

d∏i=1

node count(pji , lcyi )), (9)

where ‖cx‖ is the number of rules in the tuple of cx. Thevalue of ‖R′

cx,cx‖ is set to ∞ to avoid expanding a tuple toitself. If there exists any � value of cx, denoted as �cxi , where1 ≤ i ≤ d, a value larger than �

cyi , ‖R′

cx,cy‖ is also set to ∞.The number of all possible combinations of rule expansion is

thus equal to(

Γ1

)+

(Γ2

)+ . . .+

(ΓT

). This model

is also known as a knapsack model, which is NP-hard [40].To simplify the calculation, we constrain our algorithm

to expand rules only to the other existing length combina-tions. The length combinations are defined as a set, £ ={c1, c2, . . . , cm}, where m is the number of existing lengthcombinations. The number of computed combinations is thus

reduced to(

m1

)+

(m2

)+ . . .+

(mT

). Because the

number of the existing length combinations is usually muchlower than that of all the possible length combinations, theconstraint significantly reduces the computation cost.

We further develop a greedy algorithm to minimize the costof tuple reduction. To simplify our explanation, we define anew function t(cx) to yield the length combination to whichcx is expanded, which satisfies ‖R′

cx,t(cx)‖ ≤ ‖R′

cx,cy‖, 1 ≤y ≤ m. The greedy algorithm acts as follows.

1) ∀cx ∈ £ calculates t(cx).2) Derive the length combination cx which satisfies

‖R′cx,t(cx)

‖ ≤ ‖R′cy,t(cy)

‖, 1 ≤ y ≤ m. The tuple ofcx is removed from the tuple space after expanding allrules in the tuple.

3) Recalculate t(t(cx)).4) Recalculate t(cy) if t(cy) = cx, ∀cy ∈ £.5) Steps (2) ∼ (4) are repeated until ‖£‖ ≤ T or

‖R′cx,cy‖ = ∞, ∀cx, cy ∈ £.

The maximum number of the calculation steps is now reducedto

m(m− 1) + (m− 1)(m− 2) + . . .+ (T + 1)T

=

m−1∑i=T

i(i+ 1) < m3. (10)

In practical terms, the number of calculation steps is fewerthan m3, because Step 4 usually results in fewer steps thanm.

The above procedure repeats until ‖£‖ ≤ T , but we canalso stop the procedure by limiting the number of generatedrules, i.e.

cx,cy<Γ∑cx=1,cy=1

‖R′cx,cy‖ × jcx,cy ≤ N × e factor, (11)

where e factor is a predefined expansion factor. By settingthe value of expansion factor to two, the expansion procedureis limited to produce, at most, N−replicated rules aftercompleting tuple reduction.

WANG: SCALABLE PACKET CLASSIFICATION FOR DATACENTER NETWORKS 131

Prefixes*0000

f1Prefixes*100111

f4

C F

Prefixes*001110111

A

C

F

E

B D

HG

f2Prefixes*1001

B C D

A

f3 f5PrefixesUDPTCP

D

A

CBB C DE

A

A

D

B E

Fig. 6. Binary Tries with Complementary Nodes from Fig. 2 for Encoded Rule Expansion.

While expanding all rules to one length combination, eachrule leads to Nd rules in the worst case, which results ina total of Nd+1 rules. However, such calculation excludesredundant rules. According to the procedure on the generationof complementary nodes, there are, at most, N distinct prefixesfor a given � value in each field. Therefore, there are, at most,Nd distinct rules in the last tuple, and the space complexityis O(Nd). In the following theorem, we derive a generalizedupper bound to the number of the generated rules whileexpanding all rules to T tuples.

Theorem 1: Given a rule database with N rules, the requiredstorage for expanding N rules from m to T tuples is less thanT ×Nd×T−1/d

, where m ≥ T .Proof: To reduce the number of tuples from m to T , we

categorize the m tuples into T groups, where each tuple isexpanded to another tuple in the same group without losinggeneralization. After (m − T ) iterations of tuple reduction,only one tuple is left in each group. According to the principleof the proposed greedy algorithm, the worst case occurs whenthe number of the generated rules in each group is the same.Therefore, we measure the number of the generated rulesin each group by assuming that the T groups are evenlydistributed in the tuple space, where there are d

√T groups

on each dimension of the tuple space.Based on the upper bound of tuple reduction, Nd, the

maximum expansion factor of each dimension is N . Therefore,the one-dimensional expansion factor α within each group canbe measured by the equation, α

d√T = N . Accordingly, α is

equal tod√

T√N . Using the value of α, we can derive the overall

expansion cost by the following equation.

T × αd = T × (d√

T√N)d

= T × (NT−1/d

)d

= T ×Nd×T−1/d

(12)

We illustrate the procedure of our greedy algorithm byreducing the tuple space in Table III from eight to four tuples(i.e. T = 4). We show an all-pair matrix of tuple expansioncosts in Table VI. As mentioned above, a tuple cannot beexpanded to another tuple with at least one shorter prefixlength or itself. Thus, more than half of the costs can be set to∞ immediately. For the tuple T0, we only have to calculatethe expansion cost to T1. As T0 and T1 have different f4lengths, we calculate the number of successive prefix andcomplementary nodes, whose � values are two, for R1. InFig. 6, there are three prefix nodes and one complementary

TABLE VIVALUES OF ‖R′

cx,cy‖, ∀cx, cy ∈ £.

������c(x)c(y)

T0 T1 T2 T3 T4 T5 T6 T7

T0 ∞ 4 ∞ ∞ ∞ ∞ ∞ ∞T1 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞T2 ∞ 6 ∞ ∞ ∞ ∞ ∞ ∞T3 ∞ 4 ∞ ∞ ∞ ∞ ∞ ∞T4 18 72 24 36 ∞ ∞ ∞ ∞T5 9 36 12 24 ∞ ∞ ∞ ∞T6 ∞ 18 6 12 ∞ ∞ ∞ ∞T7 90 360 120 240 16 10 32 ∞

node, with � = 2 in the prefix trie of f4. Since the prefixexpansion cost for f4 is four and those for the other fieldsare one, the overall rule expansion cost is four. We furtherconsider the expansion cost from T7 to T1. The expansion costfor each rule in T7 is 180(= 3×5×3×4×1), and the overallcost is 360. Using the cost matrix, we further calculate t(cx)and ‖R′

cx,t(cx)‖ for each length combination cx, as listed in

Table VII.(A). From Table VII.(A), we select the tuple with theleast expansion cost. Both T0 and T3 have the least expansioncost. We randomly choose T0 and expand R0 to the lengthcombination of T1. The fourth field of rule R0 is expandedto four longer prefixes, namely, AE∗, AB∗, AC∗, and AD∗.Hence, R0 is expanded to four rules. After removing tupleT0, we update the values for T4 and T5 according to Step(4) of our algorithm. The values for T2 are not updated as ithas the longest length combination. In the next two iterations,the rules in T3 and T2 are expanded. In the last iteration, tworules in T7 are expanded to the length combination of T5,according to Table VII.(D). The size of tuple space is reducedto four tuples to stop the procedure.

We show the content of each tuple in Table VIII, where theexpanded prefixes of each expanded rule are represented as aset of longer prefixes, each separated by a slash. The rules arearranged according to their order of insertion into a new tuple.The rules listed in the parentheses are the redundant rules. Forinstance, rule (ABC∗,ADF∗,AC∗,AB∗,AB) is shared by R0

and R5, and (A∗,ABC∗,A∗,A∗,AC) is shared by R8 and R11.The redundant rules are merged to save storage. Finally, thereare 26 rules in the optimized tuple space.

VI. HASH-BASED PACKET CLASSIFICATION

In this section, we first present the search procedure basedon hash tables generated by the proposed algorithm. Sincethe proposed algorithm uses a database-dependent encoding

132 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014

TABLE VIICALCULATION FOR OPTIMIZED ENCODED RULE EXPANSION.

(A) Initializationcx t(cx) ‖R′

cx,t(cx)‖

T0 T1 4T1 NA NAT2 T1 6T3 T1 4T4 T0 18T5 T0 9T6 T2 6T7 T5 10

(B) After expanding T0 to T1

cx t(cx) ‖R′cx,t(cx)

‖T1 NA NAT2 T1 6T3 T1 4T4 T2 24T5 T2 12T6 T2 6T7 T5 10

(C) After expanding T3 to T1

cx t(cx) ‖R′cx,t(cx)

‖T1 NA NAT2 T1 6T4 T2 24T5 T2 12T6 T2 6T7 T5 10

(D) After expanding T2 to T1

cx t(cx) ‖R′cx,t(cx)

‖T1 NA NAT4 T1 72T5 T1 36T6 T1 18T7 T5 10

(E) After expanding T7 to T5

cx t(cx) ‖R′cx,t(cx)

‖T1 NA NAT4 T1 72T5 T1 36T6 T1 18

method, an efficient rule-update procedure is particularly im-portant. The second part of this section describes the updateprocedures for rule insertion and deletion.

A. Search Procedure

After performing tuple reduction, all encoded rules withthe same length combinations are stored in a hash table,where the hash keys are generated by using their concatenatedidentifiers of all fields. The set of all hash tables forms atuple space. To search for encoded rules in the tuple space,first, d one-dimensional lookups are performed to transformeach inspected field of an incoming packet header into its bestmatching prefix (BMP). Then, we replace the original headervalue with the encoding of the BMP and generate an encodedpacket header. Next, we extract the corresponding identifiersof the encoded packet header and concatenate them to generatethe hash index of the corresponding hash table. With a lineartuple search, all hash tables are accessed sequentially to obtainall matching rules.

Let us consider an example to illustrate the search pro-cedure. Assume that the header of the incoming packet,(000, 111, 10, 01, UDP ), is extracted. For each field, the cor-responding binary trie in Fig. 2 is traversed to yield the en-coded BMP. From the encoded BMPs, we generate an encodedpacket header, (ABC∗,ADF∗,AC∗,AB∗,AB). For each tuplein Table VIII, a hash index is generated by concatenating thecorresponding parts of the matching prefixes. For example, thehash index of (ABC∗,ADF∗,AB∗,AB∗,AB) is used to probethe tuple T1 and another hash index of (AB∗,AD∗,A∗,A∗,AB)is used to probe T4. After probing all tuples, three successfulmatches are returned for the encoded rules of R0, R5 and R10,which belong to the tuples T1 and T5, respectively.

The software architecture of a packet forwarding engine ofour algorithm consists of d one-dimensional data structuresand T hash tables. In this work, we adopt the algorithm,binary search on prefixes [41], for our implementation. Afterperforming one-dimensional searches, the encoded BMPs areobtained to generate T hash indices for hash accesses. Subse-quently, T hash accesses are carried out. The time complexityis thus equal to O(d log6 N + T ).

An effective improvement on the implementation is to adoptthe algorithm of PTSS [32]. In each prefix node, a T -bitbitmap is stored, where each bit corresponds to a hash table.For a node of prefix p, a bit is set to one if the correspondinghash table contains at least a rule whose field specificationis p or a subprefix of p. While performing one-dimensional

TABLE VIIIOPTIMIZED TUPLE SPACE.

MatchingTuple Rule Specifications: (f1, f2, f3, f4, f5) RulesT1 (ABC∗, ADF∗, AB∗, AC∗, AB) R1

(ABC∗, ADE∗, AC∗, AD∗, AC) R4

(ABC∗, ADF∗, AC∗, AB∗/AC∗/AD∗/AE∗, AB) R0, (R5, R6)(ABC∗/ABD∗, ADF∗, AC∗, AB∗, AB) R5

(ABC∗/ABD∗, ADF∗, AC∗, AC∗, AB) R6

(ABC∗, ADE∗, AB∗/AC∗/AD∗, AC∗, AC) R2

(ABC∗, ADE∗, AB∗/AC∗/AD∗, AB∗, AC) R3

T4 (AB∗, AD∗, A∗, A∗, AC) R7

T5 (A∗, ABC∗, A∗, A∗, AC) R8, (R11)(A∗,ABG∗/ABC∗/ADE∗/ADH∗/ADF∗,A∗,A∗,AB) R10

(A∗,ABG∗/ABC∗/ADE∗/ADH∗/ADF∗,A∗,A∗,AC) R11

T6 (A∗, AB∗, A∗, AB∗, AB) R9

searches, d bitmaps are retrieved. By intersecting all bitmaps, abitmap of hash tables with possibly matching rules is obtainedand only these hash tables are accessed.

B. Update Procedure

Our scheme supports dynamic packet classification by usingthe proposed update procedures. We consider the updateprocedure for rule insertion first.

In the case of rule insertion, a new rule might specify morethan one new prefix which is originally nonexistent. Moreover,the new prefix is inserted into the corresponding binary trie,and possibly change the prefix nesting of the original prefixes.As a result, the length combinations of the existing rules couldbe affected and a tuple reallocation of these rules is required.For example, a new rule (000∗,1∗,01∗,1∗,TCP) is inserted inTable II. The prefix 〈1∗〉 on field f4 corresponds to a newprefix node in the binary trie of f4 in Fig. 6. Assume that theidentifier of the new prefix node is F. The encoded prefixes of〈10∗〉 and 〈11∗〉 would be changed to 〈AFC∗〉 and 〈AFD∗〉,respectively. The length combinations of the encoded ruleswhose f4 contains 〈AC∗〉 or 〈AD∗〉 must be changed as well.In the worst case, all rules would be endowed with new lengthcombinations and a reconstruction of the whole data structurewould be necessary.

To cope with the problem, we propose two heuristic ap-proaches to rule insertion, based on the same concept usedin the proposed approach to range-to-prefix conversion inSection IV-A. In the first approach, we replace the new prefixof the inserted rule with its longest existing enclosure prefix.After replacing the new prefix, the inserted rule is encodedwithout changing the prefix nesting. In the previous example,

WANG: SCALABLE PACKET CLASSIFICATION FOR DATACENTER NETWORKS 133

the new prefix 〈1∗〉 of f4 is represented by the encoded prefixof 〈∗〉, denoted as 〈A∗〉, and the new rule is encoded as(ABC∗,AD∗,AB∗,A∗,AC). If the length combination of thenew rule is nonexistent, the new rule is expanded to an existinglength combination with a minimum expansion cost. Similarly,the original rule is stored in the corresponding bucket forfurther comparison.

Consider the case where the expansion cost for a new rulewith a new length combination is high. A second approach isused to further encode the new rule to an enclosure rule ofthe existing length combination. The enclosure rule R of ruleR is defined as follows. Each field of R is either an enclosureprefix of or equal to the corresponding field in R. The insertedrule is transformed into an encoded rule, as in the first updateapproach. Next, all possible enclosure rules are generated byrevising the length combination of the inserted rule directly.Among these enclosure rules, either the longest or the onewhich has the fewest hash collisions is used to representthe new rule. The inserted rule in the previous example istransformed to the encoded rule (ABC∗,AD∗,AB∗,A∗,AC)whose length combination is {3,2,2,1,2}. Since its lengthcombination is nonexistent in Table VIII, one enclosure rule,(AB∗,AD∗,A∗,A∗,AC), is generated and inserted into T4. Asrule expansion is no longer needed, each rule insertion isaccomplished with one hash table insertion.

The update procedure for rule deletion is achieved by simplytracking the length combinations to which the original ruleis expanded, and by updating all rules generated from theupdated rule. The update cost ties to the variable number ofthe expanded rules. We use the rule insertion procedure toprovide a constant update cost by inserting an updated rule.In rule deletion, the shadow rule, which contains the samefield specifications as the rule to be deleted, is inserted to thetuple space to indicate the event of deletion. While a packetmatches the deleted rule, the shadow rule is also matched toreport the event of rule deletion and ignore the match to thedeleted rule.

VII. PERFORMANCE EVALUATION

In this section, we describe how encoded rule expansionperforms on both real and synthetic databases in terms ofspeed and storage. The performance metrics inspected in-clude the number of hash tables and hash collisions, mem-ory requirements, and number of memory accesses. Memoryrequirements take all necessary data structures in account,including multiway search trees for one-dimensional searchesand hash tables. The number of memory accesses also includesthat for accessing multiway search trees and hash tables. Theaccesses for hash tables include extra accesses caused by hashcollisions, and each memory access retrieves 22 bytes3.

The performance evaluation consists of three parts. First,we test the performance of generic packet classification by us-ing databases from http://www.arl.wustl.edu/∼hs1/PClassEval.html [42]. There are three types of rule databases, namely, ac-cess control list (acl), firewall (fw), and IP chain (ipc). We also

3In [5], the same setting is used so that a decision node can be fetched inone memory access.

compare our scheme with EffiCuts [5] by using large 100K-rule databases. Second, we consider the packet classifiers withVLAN tags. Lastly, we evaluate the performance for largerule databases with network virtualization. While performingtuple reduction, we limit the number of total rules to twicethe number of original rules by setting e factor = 2.

A. Performance for Generic Rules

We first show the combined effects of our range-to-prefixconversion approach and rule encoding on the number of hashtables using real rule databases in [42]. As shown in Table IX,the length combinations of the original databases are signif-icantly reduced by the proposed range conversion approach,and a further reduction is achieved by rule encoding. In con-trast, the basic approach to range-to-prefix conversion usuallyresults in many replicated rules, especially for ACL1 752 andFW1 269. Therefore, by adopting the proposed approach andrule encoding, both storage requirement and the number ofhash tables are reduced simultaneously. The number of hashtables can be further reduced by employing our encoded ruleexpansion scheme. After eliminating more than 90% of hashtables, our scheme still has better storage performance forACL1 752 and FW1 269. We also show the numbers of hashaccesses generated by using PTSS in Table IX. While thenumber of accessed hash tables is proportional to the numberof total hash tables, the access ratios are quite diverse fordifferent databases. Although PTSS helps to decrease hashaccesses, it cannot guarantee the worst-case performance. Ourscheme, however, limits the number of hash accesses.

A similar result can be derived from another experimentsfor the synthetic 10K-rule databases, as shown in Table X. Itis worth to note that although rule encoding can eliminatehash tables, it may increase the number of hash accesses(IPC1 10K). Since rule encoding changes the rules stored ina hash table, the changed correlation between individual fieldspecifications may cause PTSS to perform worse. The resultsuggests that reducing hash tables to fractional can ensure abetter search performance.

Fig. 7 further illustrates the effects of our greedy scheme byusing the 10K-rule databases, where the number of generatedrules in each iteration of tuple reduction is recorded. Theexpansion cost of the first several iterations can be as low aszero due to the absence of longer prefixes for the field valuesof the expanded rules, as mentioned above. As there is onlyone complementary node successive to each prefix node ofthese fields, only one rule is generated for each rule expansion.In other words, there are no replicate rules for rule expansion.When the number of tuples is large, the number of generatedrules in each iteration is usually small (e.g. IPC1 10K). Afterreducing the number of hash tables, a sparser tuple space mayresult in a sharply increasing expansion cost. Since our schemeis designed for large rule databases, it can attenuate the densetuple space caused by a large amount of rules with low storagepenalty.

Next, we test the performance difference caused by dif-ferent expansion factors (e factor). In Table XI, we showthe memory space including all data structures and mem-ory accesses to accomplish one packet classification. The

134 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014

TABLE IXEFFECTS OF OUR APPROACHES ON REAL RULE DATABASES. (e factor = 2 FOR ENCODED RULE EXPANSION)

Rule Basic Range Conversion New Range Conversion Rule Encoding Encoded Rule ExpansionDatabase Rules H.T. H. ACC. Rules H.T. H. ACC. Rules H.T. H. ACC. Rules H.T. H. ACC.ACL1 752 1,834 79 19 752 46 11 752 20 9 1,170 7 4FW1 269 914 221 109 269 34 7 269 32 3 466 14 4IPC1 1550 2,180 244 33 1,550 177 14 1,550 92 11 2,922 21 8H.T.: Hash Tables, H. ACC.: Maximum Hash Accesses

TABLE XEFFECTS OF OUR APPROACHES ON SYNTHETIC 10K-RULE DATABASES. (e factor = 2 FOR ENCODED RULE EXPANSION)

Rule Basic Range Representation New Range Representation Rule Encoding Encoded Rule ExpansionDatabase Rules H.T. H. ACC. Rules H.T. H. ACC. Rules H.T. H. ACC. Rules H.T. H. ACC.ACL1 10K 12,947 109 25 9,603 79 13 9,603 51 12 17,398 7 7FW1 10K 32,136 236 64 9,311 49 6 9,311 24 4 15,169 10 4IPC1 10K 12,127 299 22 9,037 219 13 9,037 139 19 17,248 19 10

TABLE XIEFFECTS OF DIFFERENT EXPANSION FACTORS ON SYNTHETIC 10K-RULE DATABASES.

Rule 1-D 1-D e factor = 1 e factor = 2 e factor = 4 e factor = 8Database ACC. Space Space Collisions ACC. Space Collisions ACC. Space Collisions ACC. Space Collisions ACC.ACL1 10K 16 1,709 825 6 42 907 6 26 1,068 6 27 1,348 8 27FW1 10K 7 985 873 7 17 931 7 13 1,131 8 13 1,580 9 13IPC1 10K 17 1186 793 6 53 840 6 28 1,075 7 30 1,418 7 291-D ACC.: Maximum Memory Accesses for One-dimensional Searches, Space: Total Memory Space in KB, Collisions: Maximum Hash CollisionsACC.: Maximum Memory Accesses for One Packet Classification

Fig. 7. Number of Replicated Rules in Each Iteration of Tuple Reduction.

numbers of maximum hash collisions in all hash tables arealso revealed. There is a tradeoff between storage and speedperformance, but the speed improvement becomes negligiblewhen e factor > 2. A larger e factor also results in morehash collisions. Thus, we fix the value of e factor to two inthe following experiments.

Next, we use another two sets of rule databases, 1K and 5K,in [42] to show both storage and speed performance of ourscheme. As shown in Table XII, fivefold rules only result inmoderate increases in memory space. It is also worth to notethat about half or more memory accesses are taken by one-dimensional searches. Thus, the performance of hash-basedpacket classification can be improved by using faster one-dimensional searches.

To test the update performance of our scheme, we randomlygenerate 10K rules for each 10K-rule database in [42]. Thenew rules are inserted into the hash tables after performingtuple reduction. For each new rule, the number of inserted

Fig. 8. Number of Rules with New Inserted Rules.

rules and the collisions caused by the inserted rules arerecorded, as shown in Fig. 8 and 9. Since ACL1 10K has thefewest hash tables, its update cost is the highest because of theincreasing differences between both length combinations ofnew rules and the existing hash tables. Higher update cost alsoraises hash collisions, as shown in Fig. 9. The hash collisionsfor ACL1 10K can be as high as nine, but those for FW1 10Kand IPC1 10K are usually less than seven. FW1 10K has theleast update cost since most new rules can be inserted by usingtheir enclosure rules.

Lastly, we compare the scalability of our scheme withEffiCuts based on three 100K-rule databases generated byClassBench [43]4. As shown in Table XIII, our scheme hascomparable storage requirements as EffiCuts. Both EffiCutsand our scheme can control the number of replicated rules. Themain storage overhead for EffiCuts comes from the data struc-

4The results of EffiCuts are extract from [5]. More results are available in[44]. Wo do not include them due to limited page space.

WANG: SCALABLE PACKET CLASSIFICATION FOR DATACENTER NETWORKS 135

TABLE XIIPERFORMANCE FOR SYNTHETIC 1K AND 5K RULE DATABASES.

Rule 1K Databases 5K DatabasesDatabase Rules H.T. Collisions Space (KB) 1-D ACC. ACC. Rules H.T. Collisions Space (KB) 1-D ACC. ACC.ACL1 1,788 3 6 552 16 24 7,764 6 6 691 16 32FW1 1,531 6 6 546 16 27 6,697 6 7 696 11 18IPC1 1,798 11 5 551 15 29 8,503 12 7 694 16 37

Fig. 9. Number of Collisions for the New Inserted Rules.

TABLE XIIIPERFORMANCE COMPARISONS WITH EFFICUTS.

Rule EffiCuts [ ] Encoded Rule ExpansionDatabase Space ACC. Rules H.T. Collisions Space ACC.ACL3 100K 5.3 83 166,829 14 7 5.4 23FW3 100K 3.7 53 138,384 9 7 4.2 28IPC2 100K 5.5 17 166,121 5 8 4.6 14Space: Total Memory Space in MB

ture, including pointers and node information, for maintainingthe decision trees. Our scheme uses hash tables to store rulesto minimize the storage of maintaining rules. The main storageoverhead of our scheme comes from the rule replications forreducing hash tables and the one-dimensional data structures.For speed performance, our scheme outperforms EffiCuts.EffiCuts uses multiple decision trees (5-6 trees as reportingin [5]) to reduce storage requirement. For each decision tree,a linear search upon all rules in a leaf bucket (up to 16 rules)is performed. Our scheme uses hash functions to significantlyreduce the number of comparisons.

B. Performance for Rules with VLAN Identifiers

In the following experiments, we inject VIDs into rules toshow the performance for Layer-2 network virtualization. Weuse the 100K rule databases in these experiments. We firstconsider the scenario with only single VIDs to simulate thescenario of an enterprise (or single-tenant) datacenter. Becausesome rules could be applied to all packets in a datacenternetwork, we also generate wildcards for the VIDs with apercentage ranging from 0 to 90. The tuple space for the ruleswith one VID has six dimensions, while the rules with twoVIDs will result in a seven-dimension tuple space. We do notperform any specific treatment for the extra VID fields.

TABLE XIVPERFORMANCE FOR 100K-RULE DATABASES WITH ONE RANDOMLY

INJECTED VID.

AC

L3

W.C. H.T.1 H.T.2 H.T.3 Collisions Space ACC.0 1,974 125 19 8 5.9 22

10 3,050 194 25 8 6.0 2130 3,342 204 24 8 6.0 2050 3,428 207 26 8 6.0 2070 3,373 207 28 8 5.9 1990 3,071 191 25 8 6.0 21

FW3

W.C. H.T.1 H.T.2 H.T.3 Collisions Space ACC.0 1,248 46 11 7 4.7 25

10 1,968 67 19 9 4.7 2730 2,082 75 20 10 4.7 2650 2,052 73 23 8 4.7 2870 2,161 73 21 8 4.8 2890 1,995 66 16 8 4.7 26

IPC

2

W.C. H.T.1 H.T.2 H.T.3 Collisions Space ACC.0 77 25 6 7 5.0 16

10 112 28 10 9 5.1 1630 115 31 10 8 5.2 1650 115 32 10 8 5.1 1570 115 30 10 7 5.1 1690 114 29 9 10 5.3 16

W.C.: Wildcard Percentage of VID, H.T.1: H.T. for Original RulesH.T.2: H.T. for Encoded Rules, H.T.3: H.T. for Expanded RulesSpace: Total Memory Space in MB

Fig. 10. Number of Replicated Rules in Each Iteration of Tuple Reduction.

As shown in Table XIV, the extra field of VID signifi-cantly increases the number of length combinations. Both ruleencoding and expansion can effectively reduce the numberof hash tables. The number of hash tables after performingrule expansion is about twice as in Table XIII to maintainspeed performance. The extra field of VID affects storageperformance because of extra hash tables. However, rulereplications are not increased significantly. We also show theexpansion cost of each tuple reduction for the databases with50% VID wildcards in Fig. 10.

136 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 1, JANUARY 2014

TABLE XVPERFORMANCE FOR 100K-RULE DATABASES WITH TWO RANDOMLY

INJECTED VIDS.

AC

L3

W.C.×W.C. H.T.1 H.T.2 H.T.3 Collisions Space ACC.10×50 5,192 324 45 9 5.8 2350×10 5,195 323 39 9 5.8 1950×50 5,874 350 48 8 5.8 2150×90 5,217 317 42 7 5.8 2190×50 5,230 326 37 8 5.8 20

FW3

W.C.×W.C. H.T.1 H.T.2 H.T.3 Collisions Space ACC.10×50 3,435 109 29 9 4.8 3550×10 3,399 113 33 7 4.8 3450×50 3,549 120 38 8 4.8 3650×90 3,368 109 33 8 4.8 4090×50 3,512 109 30 8 4.8 37

IPC

2

W.C.×W.C. H.T.1 H.T.2 H.T.3 Collisions Space ACC.10×50 192 36 16 8 5.2 1650×10 192 36 16 8 5.2 1650×50 192 36 18 8 5.2 1550×90 192 36 16 8 5.2 1590×50 192 36 16 7 5.2 16

W.C.xW.C.: Wildcard Percentages of Both VIDsSpace: Total Memory Space in MB

A datacenter network can allow each customer to deploytheir own VLANs. To properly fulfill the access control tothese VLANs, dual VIDs can be used [7]. We randomly injecttwo VIDs to all rules in the 100K databases. Similarly, wealso consider the percentages of wildcards. There are twopossible wildcard combinations, both VIDs and second VIDonly. The former is specified for rules which apply to alltraffic in a datacenter network. The later is specified for ruleswhich only apply to a specific VLAN/customer. Table XVshows that dual VIDs further increase the number of originallength combinations. Although the speed performance of FW3is degraded because of the extra hash tables, both speed andstorage performance of ACL3 and IPC2 are not affected.

C. Performance for Rules Enabling Network Virtualization

In the last experiments, we use ClassBench to randomlygenerate different types of rule databases. There are three setsof customer rule databases generated. The first set consists ofone thousand databases where each has one thousand rules.The second set consists of one hundred databases where eachhas ten thousand rules. The third set consists of ten databaseswhere each has hundred thousand rules. The VIDs of the rulesin all sets are randomly generated. These rules are furtherinjected with ten percent wildcards to the second VID. Foreach set, all databases are merged to simulate an extremelycomplex environment in a datacenter network. For each set,we further append a set of ten databases with different typesof databases where both VID fields are wildcards.

Table XVI shows that the databases with small customerrule databases (with 1000 rules) consume the least memorysince their rules have higher diversity for reducing the costof rule expansion. In contrast, the rules in the large customerrule databases (with 100000 rules) may correlate with eachother to increase rule replications. However, the number ofmemory accesses for all databases does not increase severely.These results show that although the storage performance ofour scheme is related to number of rules, it is not the case forthe speed performance. This advantage makes the scheme ascalable solution for a large and complex datacenter network.

TABLE XVIPERFORMANCE FOR COMPLEX NETWORK VIRTUALIZATION

#VIDs×#Rules Rules H.T.1 H.T.2 H.T.3 Collisions Space ACC.1000× 1000 1,006,637 18,867 8,841 70 8 48.9 39100× 10000 1,056,418 18,677 8,189 65 8 53.2 4410× 100000 1,078,141 17,687 6,454 73 7 58.6 37

Space: Total Memory Space in MB

A TCAM implementation for any databases in Table XVIwill consume 40 ∼ 50MB, which is impractical nowadays.To support these databases, a software or software-hardwarehybrid implementation is necessary. Our scheme provides afeasible software implementation to enable a scalable datacen-ter network. It can achieve about 20Gbps average throughputby using 8ns reduced latency DRAM [45] with average packetsize, 850 bytes, as reported in [46]. With hardware parallelism,speed performance can be further improved, for example, byusing TCAM for one-dimensional searches or Bloom filtersfor both reducing hash collisions [47] and accessed hash tables[48].

VIII. CONCLUSIONS

A datacenter network has been getting more complicateddue to its increasing number of applications. To reduce thecost of maintaining a datacenter network, both virtualizationand consolidation is necessary. In this paper, we proposed aset of algorithms for scalable hash-based packet classificationbased on the essence of tuple space search. Our algorithmsaim at reducing the size of tuple space to achieve fastersearch speed. The main algorithm, encoded rule expansion,could significantly reduce the cost of rule expansion withthe aid of a new data structure, complementary node. Theproposed approach to range-to-prefix conversion minimizesthe storage penalty based on the property of hash accesses. Agreedy algorithm for optimizing tuple reduction is presentedto minimize the required storage of the expanded rules. Theproposed update approach can minimize the update cost forrule insertion to assure the accuracy of packet classification.The proposed scheme provides better leverage between timeand space complexity as compared with the known theoreticalworst cases of O(logN ) query time or O(Nd) storage. Inaddition, the adjustable parameter to the number of hash tablesoffers better flexibility to real applications. In the experiments,we demonstrated the scalability of the proposed scheme interms of speed and space, with respect to rule databases ofvarying sizes and characteristics. Our results show that eachpacket classification takes less than 30 memory accesses inthe worst case for small rule database, including the memoryaccesses for one-dimensional searches and hash collisions. Inthe scalability test, the number of memory accesses is lessthan 40 for large databases with one million rules. The resultsshow that our scheme is a scalable and efficient solution fora datacenter network.

REFERENCES

[1] D. A. Joseph, A. Tavakoli, and I. Stoica, “A policy-aware switchinglayer for data centers,” in Proc. ACM SIGCOMM’2008, pp. 51–62.

[2] D. E. Taylor, “Survey and taxonomy of packet classification techniques,”ACM Computing Survey, vol. 37, no. 3, pp. 238–275, 2005.

WANG: SCALABLE PACKET CLASSIFICATION FOR DATACENTER NETWORKS 137

[3] F. Yu, R. H. Katz, and T. Lakshman, “Efficient multimatch packetclassification and lookup with tcam,” IEEE Micro, vol. 25, no. 1, pp.50–59, 2005.

[4] Cisco 4700 Series Application Control Engine Appliance DeviceManager GUI Configuration Guide, Cisco, 2010. [Online]. Available:http://www.cisco.com/en/US/docs/app ntwk services/data center appservices/ace appliances/vA4 1 0/configuration/device manager/guide/dmgui cfgd.pdf

[5] B. Vamanan, G. Voskuilen, and T. N. Vijaykumar, “Efficuts: Optimizingpacket classification for memory and throughput,” SIGCOMM ComputerCommunication Review, vol. 41, no. 4, pp. 207–218, 2010.

[6] K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary, “Algorithmsfor advanced packet classification with ternary cams,” in Proc. ACMSIGCOMM’05.

[7] Network Virtualization - Path Isolation Design Guide, Cisco,2009. [Online]. Available: http://www.cisco.com/en/US/docs/solutions/Enterprise/Network Virtualization/PathIsol.pdf

[8] T. Mishra and S. Sahni, “Petcam - a power efficient tcam architecturefor forwarding tables,” IEEE Transactions on Computers, vol. 61, no. 1,pp. 3–17, 2012.

[9] W. Lu and S. Sahni, “Low-power tcams for very large forwardingtables,” IEEE/ACM Trans. Netw., vol. 18, no. 3, pp. 948–959, 2010.

[10] A. Kesselman, K. Kogan, S. Nemzer, and M. Segal, “Space and speedtradeoffs in tcam hierarchical packet classification,” in Proc. IEEESarnoff’2008, pp. 1–6.

[11] K. Zheng, C. Hao, Z. Wang, B. Liu, and X. Zhang, “Dppc-re: Tcam-based distributed parallel packet classification with range encoding,”IEEE Trans. Comput., vol. 55, no. 8, pp. 947–961, 2006.

[12] H. Che, Z. Wang, K. Zheng, and B. Liu, “Dres: Dynamic range encodingscheme for tcam coprocessors,” IEEE Trans. Comput., vol. 57, no. 7,pp. 902–915, 2008.

[13] Y.-K. Chang, C.-I. Lee, and C.-C. Su, “Multi-field range encoding forpacket classification in tcam,” in Proc. IEEE INFOCOM’2011, pp. 196–200.

[14] X. He, J. Peddersen, and S. Parameswaran, “Lop re: Range encodingfor low power packet classification,” in Proc. IEEE LCN’2009, pp. 137–144.

[15] A. Bremler-Barr and D. Hendler, “Space-efficient tcam-based classifica-tion using gray coding,” in Proc.IEEE INFOCOM’2007, pp. 1388–1396.

[16] A. Bremler-Barr, D. Hay, and D. Hendler, “Layered interval codes fortcam-based classification,” in Proc. IEEE INFOCOM’2009, 2009, pp.1305–1313.

[17] R. Wei, X. Yang, and H. J. Chao, “Block permutations in booleanspace to minimize tcam for packet classification,” in Proc. IEEEINFOCOM’2012, pp. 2561–2565.

[18] C. R. Meiners, A. X. Liu, and E. Torng, “Bit weaving: A non-prefixapproach to compressing packet classifiers in tcams,” in Proc. IEEEICNP’2009, pp. 93–102.

[19] R. Cohen and D. Raz, “Simple efficient tcam based range classification,”in Proc. IEEE INFOCOM’2010, pp. 1–5.

[20] C. R. Meiners, A. X. Liu, E. Torng, and J. Patel, “Split: Optimizingspace, power, and throughput for tcam-based classification,” in Proc.ACM/IEEE ANCS’2011, pp. 200–210.

[21] O. Rottenstreich and I. Keslassy, “Worst-case tcam rule expansion,” inProc. IEEE INFOCOM’2010, pp. 456–460.

[22] F. Yu, T. V. Lakshman, M. A. Motoyama, and R. H. Katz, “Efficientmultimatch packet classification for network security applications,”IEEE J. Sel. Areas Commun., vol. 24, no. 10, pp. 1805–1816, 2006.

[23] T. Lakshman and D. Stidialis, “High-speed policy-based packet forward-ing using efficient multi-dimensional range matching,” in Proc. ACMSIGCOMM’98, pp. 203–214.

[24] F. Baboescu and G. Varghese, “Scalable packet classification,” in Proc.ACM SIGCOMM’01, pp. 199–210.

[25] P. Gupta and N. McKeown, “Classifying packets with hierarchicalintelligent cuttings,” IEEE Micro, vol. 20, no. 1, pp. 34–41, 2000.

[26] T. Y. C. Woo, “A modular approach to packet classification: Algorithmsand results,” in Proc. IEEE INFOCOM’00, vol. 3, pp. 1213–1222.

[27] S. Singh, F. Baboescu, G. Varghese, and J. Wang, “Packet classificationusing multidimensional cutting,” in Proc. ACM SIGCOMM’2003, pp.213–224.

[28] Y.-K. Chang and Y.-H. Wang, “Cubecuts: A novel cutting scheme forpacket classification,” in Proc. IEEE WAINA’2012, pp. 274–279.

[29] H. Lu and S. Sahni, “O(logw) multidimensional packet classification,”IEEE/ACM Trans. Netw., vol. 15, no. 2, pp. 462–472, 2007.

[30] P. Gupta and N. McKeown, “Packet classification on multiple fields,”in Proc. ACM SIGCOMM’99, pp. 147–160.

[31] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, “Fast andscalable layer four switching,” in Proc. ACM SIGCOMM’98, pp. 191–202.

[32] V. Srinivasan, G. Varghese, and S. Suri, “Packet classification usingtuple space search,” in Proc. ACM SIGCOMM’99, pp. 135–146.

[33] P.-C. Wang, C.-L. Lee, C.-T. Chan, and H.-Y. Chang, “Performance im-provement of two-dimensional packet classification by filter rephrasing,”IEEE/ACM Trans. Netw., vol. 15, no. 4, pp. 906–917, 2007.

[34] W. Jiang and V. Prasanna, “Scalable packet classification on fpga,” IEEETrans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 9, pp. 1668–1680, 2012.

[35] OpenFlow Switch Specification, Version 1.0.0, OpenFlow Foundation,2009. [Online]. Available: http://www.openflowswitch.org//documents/openflow-spec-v1.0.0.pdf

[36] P. Warkhede, S. Suri, and G. Varghese, “Fast Packet Classificationfor Two-Dimensional Conflict-Free Filters,” in Proc. IEEE INFO-COM’2001, 2001, pp. 1434–1443.

[37] F. Pong and N.-F. Tzeng, “Harp: Rapid packet classification via hashinground-down prefixes,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 7,pp. 1105–1119, 2011.

[38] Y.-K. Chang, “A 2-level tcam architecture for ranges,” IEEE Trans.Comput., vol. 55, no. 12, pp. 1614–1629, 2006.

[39] V. Srinivasan and G. Varghese, “Fast address lookups using controlledprefix expansion,” ACM Trans. Computer Systems, vol. 17, no. 1, pp.1–40, Febuary 1999.

[40] M. Eben-Chaime, A. Mehrez, and G. Markovich, “Capacitated location-allocation problems on a line,” Computers and Operations Research,vol. 29, no. 5, pp. 459–470, 2002.

[41] B. Lampson, V. Srinivasan, and G. Varghese, “Ip lookups using multi-way and multicolumn search,” IEEE/ACM Trans. Netw., vol. 7, no. 4,pp. 323–334, June 1999.

[42] H. Song, “Design and evaluation of packet classification systems,” Ph.D.dissertation, Dept. of Computer Science and Engineering, WashingtonUniversity, 2006.

[43] D. E. Taylor and J. S. Turner, “Classbench: a packet classificationbenchmark,” in Proc. IEEE INFOCOM’05, pp. 2068–2079.

[44] Y.-K. Chang and C.-Y. Chien, “Layer partitioned search tree for packetclassification,” in Proc. of IEEE AINA’2012, 2012, pp. 276–282.

[45] 576Mb: x18, x36 RLDRAM 3, Micron, 2013. [Online].Available: http://www.micron.com/∼/media/Documents/Products/Data\%20Sheet/DRAM/576mb rldram3.pdf

[46] T. Benson, A. Anand, A. Akella, and M. Zhang, “Understandingdata center traffic characteristics,” SIGCOMM Comput. Commun. Rev.,vol. 40, no. 1, pp. 92–99, 2010.

[47] H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood, “Fast hash tablelookup using extended bloom filter: an aid to network processing,” inProc. ACM SIGCOMM’05.

[48] S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor, “Longest prefixmatching using bloom filters,” in Proc. ACM SIGCOMM’03, pp. 201–212.

Pi-Chung Wang received the M.S. and Ph.D. de-grees in Computer Science and Information Engi-neering from National Chiao Tung University in1997 and 2001, respectively. From 2002 to 2006,he was with Telecommunication Laboratories ofChunghwa Telecom, working on network planningin broadband access networks and PSTN migration.During these four years, he also worked on IPlookup and classification algorithms. Since 2006, hehas been an assistant professor of Computer Scienceand Engineering at National Chung Hsing Univer-

sity. He is currently an associate professor. Wang’s research interests includeIP lookup and classification algorithms, scheduling algorithms, congestioncontrol and application-driven wireless sensor networks.