Design and Implementation of Dynamic Routing Tablescial.csie.ncku.edu.tw/st2008/pdf/Design and Implementation of Dyna… · that can build the routing tables dynamically are suitable

��

��

��

��

Design and Implementation of Dynamic

Routing Tables

��

��

�!�"��#�$�%�&�'�(

�

I

��

�

� � � ��

� � � � � � � � � � � �

�

) * �

�� ! "

# $ � % & ' ( ) * +, ( - . �/ 0 1 2 � � � 3 4 5 � 6 7 � � 8 �9:( ; -

$ < = > 7 � � 8 � � � �/ 0 1 2 ? @ $ A B C � � � � � �D 2 3 4 E � F G

H 5 I �� J K LM N O P Q # R S T U � � �V I W X Y Z - [ \ ] ^ _ ` a

:b � � F c d �e f F G H 5 $ � g h i j 4 E �� i 4 � �k Z l ! ! m i � n

o p q � � r � � s j � � �t u v � $ A A > 7 � � � w 4 E � � 8 � � � � �

� - T x y � �z

zzzz�{ | } [ ��~ �� , f $ < = > 7 � � 8 � � � � � F G H 5 �� - �

� � � 8 �3 � � � � � � � F c 3 4 E � , f � � � � � � � �I W � � � � � �

� � F c ; % � � � � � � � � � � � ��. � � � � � ��, f � � � � 8 3 � � �

��] ^ L` a � ¡ ¢ $ � £ ¤ ¥ ¦ § z̈ © d ª « ¬ �¨ & � � F c � f ® �¯ � ° £ ° �

� ± � ² < = > 7 � � 8 � � � A ³ ® f ´ � / 0 1 2 � � � � �D 2 µ T ¶ w �

� � � � � µ � ° £ ° L � ± � T U � � · �� J K LM N O P Q # ² Lv � A /

0 1 2 & ¸ ¹ � � � � �D 2 � º �u » �~ �3 � � � � � ² � B ¼ L ½ � ¾

¿ � « :À � Á Â Á � Ã Ä ¤ Å � Æ Ç © _ T ! � � � 8 # I È S É Ê � � �z

�� > 7 � � 8 �/ 0 1 2 z

II

Design and Implementation of Dynamic

Routing Tables

Author: Dung-Jiun Lin Advisor: Yeim-Kuan Chang

Department of Computer Science and Information Engineering

National Cheng Kung University, Tainan, Taiwan, R.O.C.

Abstract In the last couple of years, various schemes for high-performance IP address lookups

have been proposed. Those schemes can be broadly classified into two categories: the

schemes that use precomputation to build static routing tables, and the schemes that build

dynamic routing tables. The precomputation usually can simplify the entire data structure

of the routing tables and thus improve the performance of the lookup speed and memory

requirement. However, a disadvantage of the precomputation is that when a single prefix is

added or deleted, the entire data structure may need to be rebuilt. Rebuilding the routing

tables seriously affects the update performance of a backbone router. Thus, those schemes

that can build the routing tables dynamically are suitable.

In this thesis, we develop a data structure called Most Specific Prefix Tree (MSPT)

that is suitable for dynamic routing tables. MSPT is a balanced binary search tree which is

constructed by the most specific prefixes that do not cover any other prefixes in the routing

table. The rest of prefixes (non-most specific prefixes) are allocated to the enclosure set of

each most specific prefix node in MSPT. Based on MSPT, the search, insertion, and

deletion operations can be performed in O(log N) time, where N is the number of prefixes

in the routing table. Comparing with the schemes that also build dynamic routing tables

III

such as PBOB (prefix binary tree on binary tree) and MRT (multiway range tree), and

several precomputation-based schemes, MSPT gets better performance than PBOB and

MRT and the performance of lookup speed and memory requirement is near to those

precomputation-based schemes. Moreover, our proposed scheme also scales well to IPv6

and large routing tables.

Keywords��IP address lookup, dynamic routing table, precomputation

IV

Ë Ì z

�� ' �� Í Î Ï Ð ��Ñ 0 Ò Ì ~ � Ó Ô Õ Ö z× Ø Ù Ú Û ' �w �Í Î Ü Ý

¯Þ ß à á # â â Ú Ô �ã � ä å æ � Í Î ç è �é{ | } � ê ë � � ¬ ��u "

# ì � � í î ¯Ì î �z

�Í Î ¯ï ð � �k ��ñ Ò Ì ò ó � ï ô } õ �ö ÷ �ø ù Lú ö �~ � û ü

A ³ ý þ � î ��é~ � � å È A ¬ u | } �� Ì Ì � 4 �� W

� ï �� t & � � �� ] ^ I � È s � � � Ï � �é~ �É � �Í Î Ü Ý #

$ A � � ï ð � ã · � � » � > ² �� z

� � �{ } � � � ~ � � � ! " A ³ # # L$ $ �Ò Ì � �� < % ¯ & ' �é

~ $ A ( ) Í Î � * ¯ � �, ¶ % + � , - . �z

z

z

z

z

z

z

z

z

z

z

z

/ 0 1 zz 2 Ë 3 z

½ ! ï zF c 3 z

� 4 5 6 7 8 9� : ; z

V

Table of Contents

Chapter 1 Introduction ................................................................................ 1

1.1 Motivation ........................................................................................................... 1 1.2 Overview of the Thesis ........................................................................................ 5

Chapter 2 Background................................................................................. 6

2.1 The Challenges to IP Address Lookup.................................................................. 6 2.2 IPv6 Addressing Architecture............................................................................... 7

2.2.1 IPv6 Address Syntax ..................................................................................... 7 2.2.2 Type of IPv6 Address .................................................................................... 8

2.2.2.1 Unicast Address .................................................................................. 8 2.2.2.2 Anycast Address ................................................................................. 9 2.2.2.3 Multicast Address ............................................................................. 10

Chapter 3 Review of the Previous Works ..................................................11

3.1 Trie base scheme................................................................................................ 11 3.1.1 1-bit ( Binary ) trie ...................................................................................... 11 3.1.2 Patricia Trie................................................................................................. 12 3.1.3 Multibit Trie ................................................................................................ 13 3.1.4 Level Compressed Trie................................................................................ 15

3.2 End-Point Array Scheme.................................................................................... 17 3.2.1 IP Lookups Using Multiway and Multicolumn Search................................. 17 3.2.2 An Efficient IP Routing Lookup by Using Routing Interval ......................... 18

3.3 Sets of Equal-Length Prefixes Scheme............................................................... 19 3.3.1 Scalable high-speed IP routing lookups ....................................................... 19

3.4 Search Tree Base Scheme .................................................................................. 20 3.4.1 Enhanced Interval Tree for Dynamic IP Router-Tables ................................ 20 3.4.2 An O(log n) Dynamic Router-Table Design ................................................. 21 3.4.3 Multiway Range Tree .................................................................................. 22

3.5 Hybrid Base Scheme.......................................................................................... 24 3.5.1 Lulea compressed trie.................................................................................. 24 3.5.2 Huang’s Compact Algorithms...................................................................... 24

3.6 Summary ........................................................................................................... 26

Chapter 4 Propose IP Lookup Scheme...................................................... 29

4.1 Preliminaries...................................................................................................... 29

VI

4.2 Most Specific Prefix Tree – MSPT..................................................................... 30 4.2.1 Most Specific Prefix Tree ............................................................................ 30 4.2.2 Data Structure for MSPT............................................................................. 32

4.3 Finding the Longest Prefix Match ...................................................................... 35 4.4 Update ............................................................................................................... 38

4.4.1 Inserting a Prefix ......................................................................................... 38 4.4.2 Deleting a Prefix ......................................................................................... 39 4.4.3 Rotations Problem....................................................................................... 43

4.5 Enhance Our IP Address Lookup Scheme .......................................................... 47 4.6 Migrate to IPv6.................................................................................................. 49

Chapter 5 Performance.............................................................................. 51

5.1 Simulation Environment .................................................................................... 51 5.2 Simulation Result for IPv4 ................................................................................. 51 5.3 Simulation Result for IPv6 ................................................................................. 60

Chapter 6 Conclusion................................................................................. 64

References …………………………….…………………………………………………. 65

VII

List of Tables

Table 3.1 The representation of the LC-trie in Figure 3.4(c). ....................................... 16 Table 3.2 Summary for precomputation-based IP address lookup schemes. ................. 27 Table 3.3 Summary for non precomputation-based IP address lookup schemes. .......... 28 Table 4.1 A sample prefix set. ..................................................................................... 34 Table 4.2 Data structure analysis for MSPT................................................................. 48 Table 5.1 BGP routing tables. ..................................................................................... 51 Table 5.2 The statistics of memory requirement (in KB) for IPv4................................ 53 Table 5.3 Data structure analysis for non precomputation-based schemes. .................. 54 Table 5.4 The statistics of search time (in Microsecond) for IPv4................................ 55 Table 5.5 The statistics for 16-bits segmentation table................................................. 57 Table 5.6 The statistics of update time (in Microsecond) for IPv4. .............................. 59 Table 5.7 The statistics of memory requirement (in KB) for IPv6................................ 60 Table 5.8 Data structure analysis for IPv6 prefix databases. ........................................ 61 Table 5.9 The statistics of search and update time (in Microsecond) for IPv6.............. 62

VIII

List of Figures

Figure 2.1 An IP Address Lookup example. ................................................................ 7 Figure 2.2 Aggregatable global unicast address format. .............................................. 8 Figure 2.3 The three-level structure of an aggregatable global unicast address. ........... 9 Figure 2.4 The Pv6 multicast address format............................................................. 10 Figure 3.1 A 1-bit trie example. ................................................................................ 12 Figure 3.2 A Patricia Trie example. ........................................................................... 13 Figure 3.3 An example multibit trie of 3-bit stride. ................................................... 14 Figure 3.4 (a) Binary trie. (b) The Path-Compressed version of (a). (c) The

level-compressed version of (b). .............................................................................. 16 Figure 3.5 (a) Representation of the prefixes and ranges. (b) End point array. ........... 17 Figure 3.6 Routing interval example. ........................................................................ 18 Figure 3.7 Graphical representation of binary search on prefix length. ...................... 19 Figure 3.8 (a) A possible PTST. (b) An example RST for range(20). ......................... 20 Figure 3.9 The bit vector for range(20). .................................................................... 21 Figure 3.10 (a) Base interval tree (BIT). (b)~(f) prefix tree for P1~P5......................... 22 Figure 3.11 An example of multiway range tree. ......................................................... 23 Figure 3.12 Basic concept of Huang’s scheme. ........................................................... 25 Figure 4.1 The relationships for a set of prefixes. ...................................................... 31 Figure 4.2 A binary trie for a set of prefixes. ............................................................. 32 Figure 4.3 A most specific prefix tree (MSPT) example for Table 4.1. ...................... 34 Figure 4.4 The enclosure set of node b in Figure 4.3. ................................................ 35 Figure 4.5 Algorithm to find longest prefix math. ..................................................... 37 Figure 4.6 Algorithm of enclosure(x).search(d, key(x), port)..................................... 37 Figure 4.7 Algorithm to insert a prefix. ..................................................................... 39 Figure 4.8 Algorithm to delete a prefix. .................................................................... 41 Figure 4.9 Algorithm to maintain MSPT enclosure constraint following a delete. ..... 42 Figure 4.10 (a) A unbalanced MSPT after inserting a new prefix P6 (b)A rebalanced

MSPT after performing a rebalancing rotation. ........................................................ 44 Figure 4.11 Correct MSPT after adjusting. .................................................................. 44 Figure 4.12 LL and RR rotations................................................................................. 46 Figure 4.13 LR and RL rotations................................................................................. 47 Figure 4.14 An MSPT example for IPv6. .................................................................... 50 Figure 5.1 16-bits segmentation table........................................................................ 52 Figure 5.2 Total memory requirement (in KB) for IPv4............................................. 54

IX

Figure 5.3 Search time (in Microsecond) for IPv4..................................................... 56 Figure 5.4 Update time (in Microsecond) for IPv4. ................................................... 59 Figure 5.5 Total Memory requirement (in KB) for IPv6. ........................................... 61 Figure 5.6 The mean times of search and update for IPv6. ........................................ 63

1

Chapter 1 Introduction

1.1 Motivation

Since to the exponential growth rate of the traffic in the Internet, backbone links of

several gigabits per second, such as OC-192, 10 Gigabits and OC-768, 40 Gigabits are

commonly deployed. To handle gigabit-per-second traffic rates, these backbone routers

must be able to forward millions of packets per second at each port. The IP address lookup

in the routers a critical task to reach the capability of forwarding millions of packets per

second. Moreover, Internet host count is also rapidly increasing, the scarcity of IP

addresses of IPv4 leads to the approach to using classless IP subnet scheme called

Classless Inter-Domain Routing (CIDR) [6]. With CIDR, routers aggregate forwarding

information by storing address prefixes that represent a group of addresses reachable

through the same interface and each route entry (prefix) in the routing table can have

arbitrary length ranging from 1 to 32 bits, instead of 8, 16, 24 bits in Classful Address

scheme. When a router receives a packet, it uses the destination address in the packet’s

header to lookup routing database. There may be more than one route entries in the routing

table that match the destination address. Therefore, it may require some comparisons with

every route entries to determine which one is the longest matching. The longest route from

all the matched entries is called the longest prefix match (LPM). The IP address lookup

problem becomes a longest prefix matching problem and even more difficult in the router

design.

To design a good IP address lookup scheme, we should consider several key

requirements: Lookup speed, Storage requirement, Update time and Scalability. We discuss

each of these requirements in turn.

2

� Lookup speed: In order to handle the increased traffic, the IP address lookup scheme

should quickly decide each incoming packet where to be sent it next. This is clearly

important for lookup to not be a bottleneck in the Internet.

� Storage requirement: Schemes that are memory-efficient can also lead to good

search time because compact data structures can fit in fast but expensive Static RAM

memory.

� Scalability: Due to the fast growth of the Internet and increasing address needs, it is

expected that the prefix databases are growing and the address prefix length will

significantly increases when switching to IPv6. Today, IPv6 has been gaining wider

acceptance to replace its predecessor, IPv4 and has early deployed in Europe, Asia,

and North America [9]. Therefore, an IP address lookup scheme must have the

capacity to handle large routing tables and longer addresses.

� Update: Currently, the Internet has a peak of a few hundred BGP update per second.

Thus, the address lookup schemes with fast update time are desirable to avoid routing

instabilities. These updates should interfere little with normal address lookup

operation.

In the last couple of years, various algorithms for high-performance IP address lookup

have been proposed. In the survey paper [17], a large variety of routing lookup algorithms

are classified and their complexities of worst case lookup, update, and memory references

are compared. Among them, a category of algorithms is based on trie structure. Based upon

this primitive data structure, a set of prefix compression and transformation techniques are

used to either make the whole data structure small enough to fit in a cache, or to facilitate

the tree traversal procedure. IP lookup in the BSD kernel is done by using the Patricia data

structure [16], which is a variant of a compressed binary trie. This scheme takes O(W)

3

memory access for per lookup, where W is the address length. LC tries for longest prefix

match speeds up the search performance by reducing the height of a trie are developed in

[15]. Degermark et al. [2] have proposed a three-level tree structure for the routing table.

The data structure of [2], called the Lulea scheme, is essentially a three-level fixed-stride

trie in which trie nodes are compressed using a bitmap. Based on the two-level

variable-stride data structure, Huang’s compact algorithm [8] uses a compact technique to

build entire forwarding information (Compressed-Next-Hop-Array and Code-Word- Array).

This compact technique is similar to the technique used in [2].

In [18], a binary search is conducted on a set of hash tables, where prefixes with same

length are organized in one hash table. Using this scheme, we can perform finding the

longest prefix match in O(log W) expected time. Lampson et al [11] have proposed an IP

address lookup mechanism that the longest prefix match is found by performing a simple

binary search on an order array in which stores the end points of the ranges defined by the

prefixes. This scheme permits one to determine the longest prefix match in O(log N) time;

insertion and deletion operations take O(N) time, where N is the number of prefixes in a

routing table.

Sahni and Kim [10] develop a data structure, called a collection of red-black tree

(CRBT), that supports the three operations of a routing table (longest prefix match, prefix

insert, prefix delete) in O(log N) time each. In [12], Lu and Sahni develop a data structure

called BOB (binary tree on binary tree) for dynamic routing tables. Based on the BOB,

related structure PBOB (prefix BOB) and LMPBOB (longest matching prefix BOB) are

proposed for highest-priority prefix matching and longest-matching prefix. On practical

routing tables, the data structure LMPBOB and PBOB permit longest prefix matching in

O(W) and O(log N). For the insertion and delete operations, they both take O(log N) time.

Suri et al. [21] have proposed a B-tree data structure called multiway range tree. This

4

scheme achieves the optimal lookup time of binary search, but also can be updated in

logarithmic time when a prefix is inserted or deleted.

Despite the intense research that has been conducted in recent years, we think there

should be a balance between lookup speed, memory requirement, update, and scalability

for a good IP address lookup scheme. Summarizing above schemes, we can find schemes

like [2], [8], [11], [15], [18], they perform a lot of precomputation to speed up the lookup

speed and reduce the memory requirements. These precomputation may lead to rebuild the

entire data structure when adding or deleting a single prefix. It seriously affects the update

performance of a backbone router. Thus, those schemes are usually not suitable for

dynamic routing tables. On the other hand, schemes based on the trie data structure like

binary trie, multibit trie and Patricia trie [16] do not use precomputation; however, their

performances grow linearly with the address length, and thus these schemes lack the

scalability when switching to IPv6 or large routing table.

The capability of fast update is always a lacked portion for today’s IP address lookup

schemes. Although [10], [21] overcome the update problem, the complex data structures

lead to the memory requirement expanded and reduce the performance of lookup. In this

thesis, we develop a Most Specific Prefix Tree (MSPT) data structure that is suitable for

the representation of dynamic routing tables. Based on MSPT, the search, insertion, and

deletion operations can be finish in O(log N) time for a real routing table. Comparing with

some schemes which are suitable for dynamic routing tables as PBOB [12] (prefix binary

tree of binary tree), MRT [21] (multiway range tree) and several precomputation-based

schemes. MSPT gets better performance than PBOB and MRT and the performance of

lookup speed and memory requirement is near to those precomputation-based schemes.

Moreover, our proposed scheme also scales well to IPv6 and large routing tables.

5

1.2 Overview of the Thesis

The rest of the thesis is organized as follows. In chapter 2, the background knowledge

of IP address lookup problem will be given. Firstly, we explain the difficulty of IP address

lookup problem in today’s environment. Secondly, in order to switch our proposed scheme

to the next generation Internet protocol, IPv6, the address format of IPv6 will be

introduced. In chapter 3, the existing IP address lookup schemes will be classified into five

categories at first. Then we brief review these schemes in turn of lookup speed, memory

requirement, scalability, and update overhead. Chapter 4 illustrates the basic data structure

and the detailed operations (search, insert and delete) of our proposed IP address lookup

scheme. Performance comparisons using real routing tables are presented in Chapter 5.

Finally, concluding remark is given in the last chapter.

6

Chapter 2 Background

2.1 The Challenges to IP Address Lookup

As the Internet has evolved and grown, it faces two serious scaling problems:

� Exhaustion of IP address spaces: Thought the 32-bit address space of IPv4 supports

about 4 billion IP devices, the IPv4 addressing scheme is not optional, as described by

RFC 3194 [5].

� Routing information overload: As the number of network on the Internet increased,

the size and rate of growth of the routing table in Internet router is beyond the ability

to efficiently manage it.

CIDR is a mechanism to slow the growth of the router tables and allow for more

efficient allocation of IP addresses than the old class A, B and C address scheme. In CIDR

mechanism, it allows address aggregation at several levels and the routing address can be

divided into two portions, network and host identifier. The address is written in the

following format <route prefix / prefix length>, where the prefix length ranges between

1-32 bits for IPv4 and 1-128 bits for IPv6. An IP address might match several prefixes in a

routing table. The matched prefix with the longest length is the valid route and is called

longest prefix match (LPM). Figure 2.1 shows a simple IP address lookup example for five

prefixes The incoming packet’s destination address (140.116.82.25) matches three entries

(entry 1: 140.0.0.0/8, entry 2: 140.116.0.0/16, and entry 3: 140.116.82.0/24). Since entry 3

(140.116.82.0/24) has the longest prefix length, the packet will be forwarded through

next-hop R3. As a result, determining the longest prefix matching involves not only

comparing the bit pattern itself, but also finding the appropriate length. It makes IP address

7

lookup operation become more complex and difficult.

Figure 2.1 An IP Address Lookup example.

2.2 IPv6 Addressing Architecture

2.2.1 IPv6 Address Syntax

IPv6 uses 16-bit hexadecimal number fields separated by colons (:) to represent the

128-bit addressing format making the address representation less cumbersome. Here is an

example of a valid IPv6 address: 2001:0400:13F0:0000:0000:09C0:876A:130B. Some

types of address contain long sequences of zeros. To further simply the representation of

IPv6 address, IPv6 uses the following conventions:

� Leading zero in the address field are optional can be compressed. For example, the

following hexadecimal numbers can be represented as shown in a compressed format:

2001:0400:13F0:0000:0000:09C0:876A:130B (original form)

=> 2001:400:13F0:0:0:9C0:876A:130B (compressed form)

� A pair of colons (::) represents successive field of 0. However, the pair of colons is

allowed only once un a valid IP address.

2001:0400:13F0:0000:0000:09C0:876A:130B (original form)

Entry Number Prefix Next-Hop 1 140.0.0.0/8 R1 2 140.116.0.0/16 R2 3 140.116.82.0/24 R3 4 140.116.246.0/24 R4 5 140.118.0.0/24 R5

Forwarding table

Destination Address

140.116.82.25

8

=> 2001:400:13F0::9C0:876A:130B (compressed form)

The IPv6 prefix is part of the address that represents the left-most bits that have a

fixed value represent the network identifier. IPv6 prefix is represented using the

IPv6-prefix/prefix-length format just like an IPv4 address represented in the classless

interdomain routing CIDR notation.

2.2.2 Type of IPv6 Address

There are three major types of IPv6 address: unicast, anycast and multicast address

2.2.2.1 Unicast Address

A unicast address is an address for a single interface. There are three types of unicast

address: Aggregatable Global unicast address, Site-local unicast address and Link-local

unicast address. Site-local and Link-local unicast address are used in LAN, routers would

not deal with those two kind of address. Therefore, we just introduce the address format of

Aggregatable Global unicast address. Figure 2.2 shows the structure of an aggregatable

global unicast address. The fields in the aggregatable global unicast address are:

Figure 2.2 Aggregatable global unicast address format.

� TLA ID – Top-Level Aggregation Identifier. The size of this field is 13 bits. The TLA

ID identifies the highest level in the routing hierarchy.

� RES – Bits that are reserved for future use in expanding the size of either the TLA ID

or the NLA ID.

TLA ID 001 Res NLA ID SLA ID Interface ID

3 bit 13bit 8 bit 24 bit 16 bit 64 bit

9

� NLA ID – Next-Level Aggregation Identifier. The NLA ID is used to identify a

specific customer site.

� SLA ID – Site-Level Aggregation Identifier. The SLA ID is used by an individual

organization to identify subnets within its site.

� Interface ID – uses IEEE EUI-64 identifier to indicate the interface on a specific

subnet.

The fields within the aggregatable global unicast address create a three-level structure

is shown in Figure 2.3. The public topology is collection of larger and smaller ISPs that

provide access to the IPv6 internet. The site topology is the collection of subnet within an

organization’s site.

Figure 2.3 The three-level structure of an aggregatable global unicast address.

2.2.2.2 Anycast Address

The anycast address is a global unicast address that is assigned to a set of interfaces

that typically belong to different nodes. Hence an anycast address identifies multiple

interfaces. A packet sent to an anycast address is delivered to the closed interface. Anycast

address is syntactically indistinguishable from global unicsat address because anycast

address is allocated from the global unicast address spaces.

TLA ID 001 Res NLA ID SLA ID Interface ID

48 bit 16 bit 64 bit Pubilc Topology Site Topology

10

2.2.2.3 Multicast Address

In IPv6, multicast traffic operates in the same way that it does in IPv4. An Ipv6

multicast address is a n IPv6 address that has a prefix of FF00::/8. It is easy to classify as

multicast because it always begins with “FF”. Figure 2.4 gives the IPv6 multicast address

format. The fields in the global multicast address are:

Figure 2.4 The Pv6 multicast address format.

� Flags – the flag uses the low-order bit of the flag field. When set to 0, it indicates that

the multicast address is a permanently (well-know) multicast address. When set to 1,

it indicates that the multicast address is a transient (non-permanently-assigned).

� Scope – indicates the scope of the IPv6 internetwork for which the multicast traffic is

intended.

� Group ID – identifies the multicast group and is the unique within the scope.

11111111 Flags Scope Group ID

8 bits 4 bits 4 bits 112 bits

11

Chapter 3 Review of the Previous Works

Many algorithms have been proposed in recent years regarding the longest prefix

match. In this section, we present a survey of IP address lookup algorithms and compare

their performance in terms of lookup speed, memory requirement, scalability, and update

overhead.

3.1 Trie base scheme

3.1.1 1-bit ( Binary ) trie

A trie is a tree–based date structure allowing the organization of prefixes on a digital

basis by using the bits of prefixes to direct the branching. Figure 3.1 shows an example of

1-bit trie (binary trie). 1-bit trie is a basic and simple data structure used in IP lookup

algorithms in which each node contains two pointer, 0-pointer (pointer to left child) and

1-pointer (pointer to right child). The 1-bit trie is also referred as the binary trie. It is in fact

a binary search tree using the bit value (0 or 1) to guide the search to the left or the right

part of the tree.

Binary trie has the characteristic that long sequences of one-child nodes may exist.

Those one-way branch nodes may consume additional memory. Moreover, since those

nodes need to be inspected, search time can be longer than necessary in some case. In

binary trie we potentially traverse a number of nodes equal to the length of addresses.

Therefore, the search complexity is O(W), where W is the address length. Update

operations are basically need a search, so update complexity also is O(W). The memory

consumption for a set of N prefixes has complexity O(WN).

12

Figure 3.1 A 1-bit trie example.

3.1.2 Patricia Trie

Binary trie consumes a lot of space to store prefixes. In order to reduce space and time

complexity, a technique called Path-Compressed can be used. Path compression consists of

collapsing one-way branch nodes. Patricia Trie is first proposed in this technique. It is a

variation of trie, and it also called BSD radix trie [16]. Patricia trie must be a complete

binary tree. It has exactly N external nodes and N-1 internal nodes. So, the space

complexity is O(log N). The space complexity of Patricia Tree is better than binary trie.

For example if we have four entries a, b, c and d, it corresponds to Figure 3.2. If we search

the LPM 01001, we find bit 0 is off and then compare to prefix a. This is a correct result. If

we want to search another LPM 10111, first, we find bit 0 is on, and then bit 2 is on and

then bit 4 is on, finally we check it with prefix d, we find this is not our answer, when this

situation occur , we must recursively backtrack and find entry b is correct answer.

According to the example mentioned above, if backtracking problem does not occur,

the lookup complexity is O(W). When backtracking problem occurs, the lookup

complexity is down to O(W2). Hence, backtracking is the big problem of Patricia trie.

P1

P2

P3

P4 P5

P1 * P2 0101* P3 100* P4 1001* P5 10111

Prefixes set

13

Figure 3.2 A Patricia Trie example.

3.1.3 Multibit Trie

Binary trie needs many number of memory access to search LPM. In order to reduce

the number of memory accesses, we can use multibit trie. It matches several bits at a time.

The depth of the subtrees combined to form a single multibit trie node is called the stride.

In a multibit trie, if all nodes at the same level have the same stride size, we say that it is a

fixed stride; otherwise, it is a variable stride. Figure 3.3 shows a multibit trie of 3-bit stride.

If we modify binary trie to a multibit trie of m bit stride, the number of memory accesses

can be reduced from W to W/m.

a : 0* b : 10* c : 10110 d : 10101*

Prefixes set

Bit 0

Bit 2 a

Bit 4 b

c d

14

Figure 3.3 An example multibit trie of 3-bit stride.

� � �

� � �

� � �

� � � � � � � �

� � � � � � � �

� � � a b b g

� � � � � � �

� � � � ��

��

� �

� � �

��

� �

�

Expanding the binary trie

� ��

15

3.1.4 Level Compressed Trie

Path-Compressed technique is a good idea to reduce the unnecessary path and

decrease space complexity. LC trie [15] extends this idea and reduces the depth of tree

further. For example, the binary trie of six prefixes is shown in Figure 3.4(a) and the

Path-Compressed version of binary trie is shown in Figure 3.4(b). Finally, Figure 3.4(c)

shows the Level-Compressed version of Path-Compressed trie. The construction of an

LC-trie for N prefixes takes time O(Nlog N). Each node (including internal node) has three

columns to represent the LC-trie. The first 5 bits represents the branching factor, and the

number is always a power of 2, and hence, the maximum branching factor is 312 . The next

7 bits are skip value. In this way, we can represent values in the range form 0 to 127. The

remaining 20 bits are served as the pointer to the most left child. Table 3.1 shows the

representation of the LC-trie in Figure 3.4(c). For example, we want to search the LPM

1110110, we start at the root, node number 1. We see that the branching factor is 2, and

skip value is 0 and therefore we extract the first two bits from the string. These 2 bits have

the value 3, which is added to the pointer, leading to position 5 in Table 3.1. At this node,

the branching factor is 1 and the skip value is 3 and therefore we extract the sixth bit. This

value is 1, and when we add 1 to the pointer, we arrive at position 9. At this node, the

branching value is 0, and the pointer is at position f. This is correct answer. According to

the above description, the LC-trie is proposed for compressing the level of tree. It means

we need less space to store forwarding table.

16

Figure 3.4 (a) Binary trie. (b) The Path-Compressed version of (a). (c) The

level-compressed version of (b).

Table 3.1 The representation of the LC-trie in Figure 3.4(c).

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� �

�

�

�

�

�

�

� �

�

�

�

� �

� �

��

�

�

� �

�

�

�

�

�

�

��

��

� �

��

�

� �

� ��

� �

�

� �

��

�

(a)

(b) (c)

� ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

17

3.2 End-Point Array Scheme

3.2.1 IP Lookups Using Multiway and Multicolumn Search

Lamposn, Srinivasan, and Varghese [11] have proposed a data structure in which the

end points of the ranges defined by the prefixes are stored in ascending order in an array.

The longest prefix match is found by performing a simple binary search on this order array.

They also propose a way to use a initial array index by the first X bits of the address,

together with taking advantage of cache line size to do a multiway search with

six-branching. By this scheme, the longest prefix match can be determined in O(log N) by

performing a binary search. Updating the range end-point array following the insertion or

deletion of a prefix also takes O(N) time, where N is the number of prefixes in router table.

Figure 3.5(a) shows an example for a set of five prefixes together with their start and finish

points and the distinct range end points are stored in ascending order in Figure 3.5(b).

Figure 3.5 (a) Representation of the prefixes and ranges. (b) End point array.

Prefix Range Start

Range Finish

P1 * 0 31 P2 01* 8 15 P3 1* 16 31 P4 10* 16 23 P5 0011* 6 7

Prefix set of 5-bit address

6

8

15 16

23

31

P1

P4

P5

P3

P2

0

7

End Point > =

0 P1 P1

6 P5 P5 7 P1 P5

8 P2 P2

15 P1 P2 16 P4 P4

23 P3 P4

31 - P3

(a) (b)

18

3.2.2 An Efficient IP Routing Lookup by Using Routing Interval

Based on the number of possible next-hops for a segment is always much less than the

total number of ports, Wang et al. [19] proposes a new routing concept named as "Routing

Interval" to simplify the finding longest prefix match problem into a much simpler search

problem. By sorting the routing prefixes based on their length, it can build a new next-hop

array in which each element maps to an IP address interval and is filled with related

next-hop. An example of three prefixes 140.116.0.0/255.255.0.0/NH1,

140.116.3.0/255.255.255.0/NH2, and 140.116.215.0/255.255.2555.0/NH3 is showed in

Figure 3.6. Moreover, in order to achieve higher performance. Unlike the cache line

alignment described in the previous scheme [11], this scheme focus on the characteristic of

memory bus between L1 and L2 cache to speedup the search performance. Based on this

scheme, the search, memory and update complexities are all the same as those in [11].

Figure 3.6 Routing interval example.

140.116.0.0/255.255.0.0/NH1

140.116.3.0/255.255.255.0/NH2

140.116.215.0/255.255.255.0/NH3

140.116.216.0~140.116.255.255/NH1

140.116.215.0~140.116.215.255/NH3

140.116.4.0~140.116.214.255/NH1

140.116.3.0~140.116.3.255/NH2

140.116.0.0~140.116.2.255/NH1

Routing Prefixes Next-Hop Array Routing Intervals

19

3.3 Sets of Equal-Length Prefixes Scheme

3.3.1 Scalable high-speed IP routing lookups

Waldvogel et al. [18] have proposed a data structure to determine longest prefix match

by performing a binary search on prefix length. According to the routing table, we can

distribute routing table into 32 parts for IPv4 and 128 parts for IPv6. First, according to

prefix length, it does binary search, and then uses hash function to find appropriate answer.

However, it must create markers for each prefix to ensure the search can be run correctly.

For example, if we have three prefixes a:0*, b:10*, c:110*, we want to search the

LPM of 1101. Figure 3.7 shows the graphical representation of this example. In binary

search, we search the level of length 2, because it does not match with b, it stops searching.

But we know the prefix c is the correct answer. In this situation, we must add a marker 11

together with b. This algorithm can be run correctly. This is one problem of this method. In

the other hand, if binary search stops, our result is mark not real prefix. Traditionally, we

must backtrack recursively and find the correct answer. But it costs a lot of time. The

author used another trick to solve this problem. That is precomputation. It means

precomputing each marker’s correct answer. In this way, it needs large spaces to store all

markers, and is difficult to support fast updates.

Figure 3.7 Graphical representation of binary search on prefix length.

3

a=0*

c=110*

11 b=10*

Binary Search Hash Table

marker

Prefix Length

2

1

20

3.4 Search Tree Base Scheme

3.4.1 Enhanced Interval Tree for Dynamic IP Router-Tables

Lu and Sahni [12] have proposed a data structure called BOB (binary tree on binary

tree) for dynamic router tables. The first-level is a red-black tree which is called point

search tree (PTST). For each node z in PTST stores a point, point(z) and a range subset,

range(z). The points in the left subtree of node z are < point(z) and those in its right subtree

are > point(z). Let R be the set of ranges stored in the PTST. For all ranges r � R such

that start(r) ≤ point(z) ≤ finish(r) are stored in range(z) of the node z. All r � R such that

finish(r) < point(z) are stored in the left subtree of z and the remaining ranges of R are

stored in the right subtree of z. For every node z in PTST, range(z) is represented as a

balanced binary search tree called the range search tree (RST). The RST in each node is

called a second-level tree. Figure 3.8(a) is an example of possible PTST for a range set and

Figure 3.8(b) is an example RST for range(20).

Figure 3.8 (a) A possible PTST. (b) An example RST for range(20).

Prefix Range length

P1 * [0, 31] 0

P2 001* [4, 7] 3

P3 1* [16, 31] 1

P4 10* [16, 23] 2

P5 1000* [16, 17] 4

P6 100* [16, 19] 3

P7 101* [20, 23] 3


6

16

20

12 ([0, 31], 0)

([16, 31], 1) ([16, 23], 2) ([20, 23], 3) ([4, 7], 3)

([16, 17], 4) ([16, 19], 3)

[16, 23], 2

(b)

P3 P7

[16, 31], 1 [20, 23], 3

(a)

P4

21

Further to reduce the memory usage, it can replace the RST stored in each node of

PTST by a W-bit vector, bit(z)[i] denotes the ith bit of bit vector stored in node z of the

PTST, bit(z)[i] = 1 iff range(z) has a prefix whose length is i. Figure 3.9 shows the bit

vector for range(20) in PTST. Based on this scheme, it takes O(W) time to determine the

LPM, and take O(log N) time to handle insertion/deletion operation.

Figure 3.9 The bit vector for range(20).

3.4.2 An O(log n) Dynamic Router-Table Design

Sahni and Kim [10] propose the use of a collection of red-black tree to determine

longest prefix match. The CRBT comprises a front-end data structure that is call the binary

interval tree (BIT) and a back-end data structure called a collection of prefix trees (CPT).

For each of the external node in the BIT contains a basic interval x points to the nonheader

node that represents the basic interval in the prefix tree for next(x), and the prefix tree for

prefix p comprises a header node plus one internal node for every prefix or basic interval x

such next(x) = p. The next(x) is defined as the smallest range prefix whose range includes

the range x for each prefix and basic interval x. The basic interval tree and the prefix tree

for the five prefixes are showed in Figure 3.10(a)-(f).

1 0 1 0 1

P3 P4 P7

0 1 2 4 3 Bit number = prefix length

22

Figure 3.10 (a) Base interval tree (BIT). (b)~(f) prefix tree for P1~P5.

The search for LPM begins with a search of the BIT for the matching basic interval

for destination address of the incoming packet. and then by determining whether the

destination address equals the left (right) end-point of the matching basic interval. If not

equal, it begins to use the basic interval pointer stored in the external node of BIT to reach

the header node of the prefix tree that corresponds to LPM. When a CRBT is used, finding

longest matching prefix as well as to insert and delete a prefix in O(log N) time, where N is

the number of prefixes in the router table.

3.4.3 Multiway Range Tree

Since ordinary binary search on the end point array relies on precomputation [11],

Prefix Range Start

Range Finish

P1 P2 P3 P4 P5

* 001* 100* 1001* 10111

0 3 16 18 23

31 7 19 19 23


7

18

3

0

16

23

19 31

r1

r2 r3 r4 r5 r6 r7

(a)

7 0

3

16

23

19

P1

P3

P2

r1 r3 r6

r7

(b)

3

P2

r2

(c)

18

16

P3

r4

P4

(d)

18

P4

r5

(e)

P5

(f)

23

Suri et al. [21] have proposed a new data structure called multiway range tree for dynamic

router tables, which achieves the optimal lookup time of binary search, but can also be

updated fast when a prefix is added or deleted. The main idea behind this scheme is that

each prefix maps to a range in the address domain. A set of n prefixes partition the address

line into at most 2n intervals. It builds a tree, whose leaves correspond to the endpoints of

these intervals. Figure 3.11 shows an example. The search for longest matching prefix can

be done by searching the matching interval of the tree. Moreover, by increasing the arity of

the tree and taking advantage of cache line size, it can reduce the height of the tree to

improve search time. By this scheme, finding the longest prefix match and inserts/deletes

take O(log N), where N is the is the number of prefixes in the route table.

Figure 3.11 An example of multiway range tree.

i

c f

i c f a b d e g h j k

k m

l m n o

P1

P2 P3 P4 P5 P6 P7 P8 P9

Start Point Finish Point P1 a o P2 b g P3 h j P4 k n P5 c d P6 e f P7 i j P8 K l P9 m n

24

3.5 Hybrid Base Scheme

3.5.1 Lulea compressed trie

Degermark et. al [2] have proposed the use of a three-level trie in which the strides of

each level are 16, 8 and 8. They also propose encoding the node in this trie using bit vector

and three other small arrays used to store the information about how to calculate the index

to the first-level array. Those three arrays are 1-D base, 1-D codeword, and 2-D maptable

arrays to reduce memory requirements.

Level two and level three of data structure consist of chunks. A chunk covers a

subtree of height 8 and can contain at most 28 = 256 heads. There are three varieties of

each chunk depending on how many heads the imaginary bit-vector contains. When there

are l-8 heads, the chunk is sparse, 9-64 heads, the chunk is dense, 65-256 heads, the chunk

is very dense. Dense and very dense are searched analogously with the first level. For

sparse chunks, we can use the linear scan to obtain the routing information.

3.5.2 Huang’s Compact Algorithms

Nen-Fu Huang proposed an address lookup scheme with a two-level memory

organization base on variable-stride trie concept in [8]. The first 16 bits of destination IP

address is used as index to the first level array. Each element of the first level array is

added with a length field to indicate the size of the next level array called next hop array

(NHA). NHA is further compressed by using a Code Word Array (CWA) and a compressed

NHA (CNHA). Figure 3.12 show the basic concept of Huang’s scheme. By using this

compression technique, the forwarding table is small enough to fit into faster SRAM and

can be implemented using a hardware pipeline to improve the speed of address lookup.

25

Figure 3.12 Basic concept of Huang’s scheme.

1000000010000000 0000000010001000 0 2

Map Base

Code Word

16 bits 16 bits

offset offset offset offset

…..

Segment

Segmentation Table

….. ….. ….. ….. 2k0 KB

Next Hop Array

2k1 KB

Next Hop Array

2k2 KB

Next Hop Array

2k3KB

Next Hop Array

Segment offset k bit 16 bit

K0 K1 K2 K3

� � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��

� � � � � � � � � � �

� � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � �

26

3.6 Summary

In fact, current IP lookup schemes can be broadly classified in two categories: one is

the schemes that use precomputation to build static routing tables and the schemes that

build dynamic routing tables. Schemes like "LC-trie" [15], "multiway and multicolumn

search" [11], "routing interval search" [19], "binary search on prefix length" [18], "Lulea

compressed trie" [2] and "Huang’s compact algorithm" [8], they perform a lot of

precomputation. The precomputation usually can simplify the entire data structure

constructed by the IP address lookup algorithm to get better performance on the lookup

speed and memory requirement. However, the down side of this precomputation is that

when a single prefix is added or deleted, the entire data structure may need to be rebuilt. It

seriously affects the update performance of a backbone router. Thus, those

precomputation-based schemes are usually not suitable for dynamic routing tables. Table

3.2 shows the comments and characteristics of those precomputation-based schemes.

On the other hand, scheme based on the trie data structure like "binary trie", "Patricia

trie" [16], "multibit trie" do not use precomputation. However, their performances grow

linearly with the address length, and thus these schemes lack the scalability when

switching to IPv6 or large routing table. CRBT (collection of red-black search trees [10]),

MRT (multiway range tree [21]) are the schemes for dynamic routing tables. But the

complex data structure may lead to inflate the memory requirement or reduce the

performance for lookup speed. The comments and characteristics of those non

precomputation-based schemes are showed in Table 3.3.

27

Table 3.2 Summary for precomputation-based IP address lookup schemes.

schemes Comments and characteristics

LC-trie [15]

This scheme extends the Path-Compressed idea and reduces the depth of tree further. Author uses array data structure to represent this tree. Each entry in the array is represented a node in LC-trie and contains routing information. For the update operation, rebuilding entire data structure is necessary. Since this scheme is trie-based, switching to IPv6 may cause the memory requirement inflated and search, update performances become poor.

Multiway and

multicolumn search [11]

Lookup speed (O(log N)) and memory requirement (O(N)) and scalability depend on number of distinct end points stored in an order array. The corresponding routing information with the end points must be precomputed. This precomputation lead to rebuild whole data structure when inserting or deleting a prefix.

Routing interval

search [19]

For lookup speed, update, memory requirement and scalability are similar to those in [11]. Based on scheme, author declares it can take less time to rebuild entire data structure than [11] when performing an update operation.

Binary search on prefix

length [18]

Choosing a good hash function is a critical task. When adopting a perfect hash, the LPM can be found in O(log W) expected time. But it needs a lot of memory requirement. For the update issue, the maker’s LPM needs to be reconstructed.

Lulea compressed

trie [2]

Huang’s compact

algorithm [8]

Both them use the bit vector to represents the node stored in the trie. For IPv4, they both get good performance on memory requirement and lookup speed. When switching to IPv6, the first-level (16-bit segmentation table) is no longer suitable. Longer address length may lead to the stride framework needs a lot of memory requirement and reduce the performance of lookup speed. The precomputation operation also leads to rebuilt whole data structure when performing an update operation.

28

Table 3.3 Summary for non precomputation-based IP address lookup schemes.

schemes Comments and characteristics

Binary trie

Search time and update performance depend on the address length (O(log W)). The long sequences of one-child nodes consume additional memory (O(log NW)). All performances will become obviously poor when switching to IPv6.

Multibit rie Choosing the strides affects search time, update and memory requirement. When switching to IPv6, the memory requirement intense increase.

Patricia trie [16]

It makes a lot of sense when a binary trie is sparsely, but as the number of prefixes increases and the trie gets denser, using path compression has little benefit.

BOB [12]

CRBT [10]

Multiway range tree

[21]

Those three schemes can deal with the dynamic routing tables. Based on those schemes, the three operations for IP lookup (search, insertion and deletion) can be performed in O(log N) time in a real routing table. Due to the complex data structure, [10] and [21] need more large memory spaces to store entire routing information and take much time to finish an update operation. On the contrary, using the simple data structure described in [12] can get better performance for the memory requirements and updates. For the lookup speed, since the tree height in [21] is lowest, it may have the best performance among those three schemes. By comparing those three schemes in turn of lookup speed, memory requirement, update and scalability. We think that using BOB [12] to find the LPM may be a good choice among those three schemes.

29

Chapter 4 Propose IP Lookup Scheme

In this chapter, firstly, we introduce the notations and terminology used in this chapter.

Secondly, our proposed IP address lookup scheme (MSPT) and the detail of search,

insertion, and deletion operations will be discussed in the next. Finally, we also migrate our

scheme to the next generation Internet protocol, IPv6.

4.1 Preliminaries

Definition 1 (prefix representation): A prefix P is really a range r of addresses from b to

e. It can be represented as P = r = [b, e], b ≤ e. b is the start address of prefix P and e is the

finish address of prefix P. The prefix P matches a contiguous interval [b, e] of addresses.

Definition 2 (relation): Let A = [b, e] and B = [u, v] be the ranges of two prefixes.

(a) Disjoint: A and B are said to be disjoint if none of them is enclosed by the other, i.e., A

�B = Ø.

(b) Enclosure: A and B are enclosure iff the address space covered by one range is a subset

of that covered by the other, i.e., B ⊇ A or A ⊇ B.

(c) Intersecting: A and B are intersecting iff A and B have a nonempty intersection, i.e., b <

u ≤ e < v or u < b ≤ v < e.

(d) The relation A < B or A > B only occurs when A and B are disjoint. The relation A < B

iff e < u (the finish address of A is smaller than the start address of B) and the relation A

> B iff b > v (the start address of A is larger than the finish address of B).

For example, we assume the address length is five. P1 = 1111* = [30, 31], P2 = 0101* =

30

[10, 11], P3 = 100* = [16, 19], P4 = 1001* = [18, 19]. Above those four prefixes, we can

see that P1, P2 and P3/P4 are disjoint and P1 > P3/P4 > P2. Moreover, P3 and P4 both prefixes

are enclosure.

Lemma 1: For any two different prefixes in a routing table, A = [b, e] B = [u, v]. A and B

are either enclosure or disjoint (i.e., they cannot intersect).

Proof: When the prefix length of A is equal to the prefix length of B, the address matched

by A and B are different. Therefore, the ranges covered by A and B are disjoint. When the

prefix length of A is not equal to the prefix length of B, without loss of generality, assume

the prefix length of A is larger than the prefix length of B. If B is not a prefix of A (i.e., A

and B differ in one of the specified bits), then the ranges covered by A and B are disjoint.

Otherwise, if B is a prefix of A, we have u ≤ b ≤ e ≤ v. Consequently, A and B are

enclosure.

4.2 Most Specific Prefix Tree – MSPT

4.2.1 Most Specific Prefix

Definition 3: A prefix is called a most specific prefix if it does not enclose any other

prefixes in the routing table. Otherwise, it is called a non-most specific prefix.

An example for a set of prefixes is as following: P1 (1*), P2 (0101*), P3 (100*), P4

(1001*), and P5 (10111). Since P2 does not overlap with any other prefixes, P4 has the

longest prefix length in an overlapping prefixes set (P1, P3, P4), and P5 also has the longest

31

prefix length in an overlapping prefixes set (P1, P5). Therefore, it is obviously that P2, P4

and P5 are the most specific prefixes. P1 and P3 are non-most specific prefixes. Figure 4.1

shows the relationships of those prefixes. To further illustrate the definition of the most

specific prefix clearly, if we use the binary trie scheme to represent all prefixes in a routing

table, all external nodes of the binary trie must be the most specific prefixes; the remaining

prefixes (non-most specific prefixes) must be internal nodes. Figure 4.2 is a binary trie

version for above example.

Figure 4.1 The relationships for a set of prefixes.

Prefix name Prefixes Start Address Finish Address P1

P2 P3 P4

P5

1* 0101* 100* 1001* 10111

16 10 16 18 23

31 11 19 19 23

23 19 18 16 11 10 0 31

P1

P2 P3

P4 P5

� P2 does not overlap with any other prefixes. � P4 and P5 have the longest prefix length in the overlapping prefix sets

(P1, P3, P4) and (P1, P5). � P2, P4 and P5 are the most specific prefixes; P1 and P3 are non-most specific

prefixes.

32

Figure 4.2 A binary trie for a set of prefixes.

Lemma 2 Let R be a set of all most specific prefixes in a routing table. For any two

prefixes a, b ∈ R, a and b must be disjoint ( a �� b = Ø ).

Proof : It is easy to see that a and b do not cover each other. As a result, a and b are

disjoint. Thus, the lemma follows.

MSPT is a balanced binary search tree. Each node in MSPT represents a most specific

prefix. All most specific prefix nodes in MSPT must be disjoint. Moreover, MSPT has an

additional enclosure constraint for placing the non-most specific prefixes, defined as

follows.

Definition 4 (MSPT Enclosure Constraint): Each non-most specific prefix p is allocated

to the enclosure set of the most specific prefix node x which is nearest to the root node of

MSPT and is enclosed by this non-most specific prefix p.

4.2.2 Data Structure for MSPT

We classify all prefixes in a routing table into two types: one is the most specific

P1

P2

P3

P4 P5

� P2, P4 and P5 are the most specific prefixes and they are the external nodes of this binary trie.

� P1 and P3 are non-most specific prefixes and they are the internal nodes of this binary trie.

33

prefix and the other is the non-most specific prefix. Since all the most specific prefixes in a

real routing table are disjoint, the prefix comparison rule described in the Definition 2(d)

can be used to construct a balanced binary search tree, called the Most Specific Prefix Tree

(MSPT). In MSPT, each node x represents a most specific prefix, prefix(x) and stores a key

value, key(x), length value, length(x), port value, port(x) and an enclosure set, enclosure(x).

The key value (we use the start address of the most specific prefix) and length value are

used to represent a most specific prefix and the port value is the output interface of the

most specific prefix. All key values in the left subtree of node x are smaller than key(x) and

those in the right subtree are larger than key(x).

Let R be the set of all non-most specific prefixes in a routing table. For the enclosure

set, enclosure(x) of a node x in MSPT, it stores those prefixes a∈ R that enclose prefix(x).

All prefixes a∈ R that are disjoint with prefix(x) and a < prefix(x) are stored in the left

subtree of node x; and the remaining prefixes of R are stored in the right subtree of node x.

This non-most specific prefixes allocation rule is recursively applies to the left and right

subtree of the MSPT. In fact, a non-most specific prefix stored in the enclosure set can be

represented by it’s prefix length and the key value stored in the node of MSPT (i.e., all the

non-most specific prefix stored in the enclosure(x) are the prefix of the most specific prefix

node x, prefix(x)). Therefore, we can only store the prefix length and port number of those

non-most specific prefixes in each enclosure set. Moreover, since all non-most specific

prefixes stored in the enclosure set are enclosure, the corresponding prefixes have different

length. We can also construct a balanced binary search tree by comparing those different

prefix lengths and each node in the enclosure set represents a non-most specific prefix.

Figure 4.3 shows an MSPT example for a prefix set in Table 4.1. Each node in Figure

4.3 represents a most specific prefix The enclosure set structure of node b is showed in

Figure 4.4. Each node in Figure 4.4 represents a non-most specific prefix

34

Table 4.1 A sample prefix set.

Figure 4.3 A most specific prefix tree (MSPT) example for Table 4.1.

Prefix Port P1 0* A P2 0101* B P3 100* C P4 1001* D P5 10111 E P6 11* F P7 0001* G P8 01* H P9 00111 I P10 001* J P11 0011* K

P2, P4, P5, P6, P7, P9 are the most specific prefixes. P1, P3, P8, P10, P11 are non-most specific prefixes.

P6 : 11*

P5 : 10111 P9 : 00111

P7 : 0001* P2 : 0101*

P1 : 0* P10 : 001* P11 : 0011*

P3 : 100*

P8 : 01*

a

b

d e f

c

P4 : 1001*

enclosure(b) enclosure(a)

enclosure(e)

35

Figure 4.4 The enclosure set of node b in Figure 4.3.

4.3 Finding the Longest Prefix Match

The longest prefix that matches the destination address d may be found by a path from

the root node toward a leaf node of the MSPT. Figure 4.5 gives the algorithm. In LPM(d)

function. When encountering a node x of the MSPT, we first check whether the destination

address d is enclosed by node x. If the most specific prefix represented by node x, prefix(x)

matches the address d, the search procedure stops. Otherwise, the search will execute

function enclosure(x).search(d, key(x), port) to check whether there exists any other

non-most specific prefixes match the address d in enclosure(x), and continues toward to the

leaf node of the MSPT.

Since all non-most specific prefixes stored in the enclosure set are constructed a

balanced binary search tree, the algorithm of function enclosure(x).search(d, key(x), port)

showed in Figure 4.6 is similar as that in LPM(d). Consider all non-most specific prefix

stored in the enclosure(x). If a non-most specific prefix represented by a node y matches

the address d, then the search will continue to the right subtree of node y to see whether

there exists another non-most specific prefix with longer prefix length also matches the

address d. Otherwise, goes to the left subtree of node y to see whether there exists another

non-most specific prefix with shorter prefix length matches the address d.

For example, if we want to find the LPM for address Dst = 00110 in Table 4.1. First,

P10

P1 P11

g

h i

36

in Figure 4.3, the most specific prefix P4 represented by root node a of the MSPT, prefix(a)

does not match Dst (prefix(a) does not encloses Dst), then we check the enclosure set of

node a, enclosure(a) to see whether there exists any other non-most specific prefixes (P3)

enclose the Dst. The result is not. By comparing the Dst and key(a), since the 00110 <

key(a) = 10010, the search will continue to the node b of the MSPT. The most specific

prefix P9 represented by node b of the MSPT, prefix(b) also does not match Dst. Checking

the enclosure set of node b, enclosure(b). In enclosure(b), we first find the non-most

specific prefix P10 represented by root node g matches Dst. Prefix P10 is temporarily stored

and the search continues to the node i. The prefix P11 represented by node i also matches

the Dst. Since P11 has longer prefix length than P10, we have to store prefix P11 to replace

P10. The search ends after checking the leaf node d of the MSPT does not match the Dst and

the enclosure set of node d is empty. Therefore, the prefix P11 is the LPM for Dst.

The time required to find the LPM can be determined by analyzing function LPM(d)

and enclosure(x).search(d, key(x), port). The time complexity of enclosure(x).search(d,

key(x), port) is readily to be O(height(enclosure(x))) = O(log(max number of the non-most

specific prefixes stored in enclosure(x))). Since all the non-most specific prefixes in

enclosure(x) are enclosure, the corresponding prefixes have different length. Thus, the max

number of the non-most specific prefixes stored in enclosure(x) is equal to the address

length, for IPv4 is 32, IPv6 is 128. Consequently, the longest prefix match can be found in

O((log N)(log W)), where N is the number of prefixes in a routing table and W is the

address length.

37

Figure 4.5 Algorithm to find longest prefix math.

Figure 4.6 Algorithm of enclosure(x).search(d, key(x), port).

Algorithm enclosure(x).search( d, key(x), port ) { /* length(y) is the length of the non-most specific prefix represented by node y in

enclosure(x). port(y) is the port information of the non-most specific prefix represented by

node y in enclosure(x). */ y = root; /* root node of enclosure set */ temp_port = port; while ( y ≠ nul l ) { /* >> is a right shirt operator */ if ( (key(x) >> (32-length(y))) = = (d >> (32-length(y))) ) { temp_port = port(y); y = RightChild(y); } else y = LeftChild(y); } return temp_port; }

Algorithm LPM(d, root) { /* prefix(x) is the most specific prefix represented by node x in MSPT.

key(x) is the key value (usually use start point of the most specific prefix). length(x) is the length of the most specific prefix represented by node x. port(x) is the port information of the most specific prefix represented by node x. enclosure(x) is the enclosure set of node x in MSPT. */

x = root; /* root node of the MSPT */ port = Default; while ( x ≠ null ) { /* first check whether the most specific prefix represented by node x encloses the address d */ if ( d ⊆ prefix(x) ) return port(x); port = enclosure(x).search( d, key(x), port ); if ( d > key(x) )

x = RightChild(x); else x = LeftChild(x); } return port; }

38

4.4 Update

4.4.1 Inserting a Prefix

Figure 4.7 shows the algorithm to insert a prefix P. In the while loop, we find the node

x nearest to the root of MSPT such that prefix P encloses prefix(x) or prefix P is enclosed

by prefix(x). If such a node x exists. According to the relation of prefix(x) and P, we insert

the prefix P into the enclosure(x) if prefix P encloses prefix(x) (i.e. now, prefix P is a

non-most specific prefix). Otherwise, if prefix P is enclosed by prefix(x), prefix(x) must to

be inserted into the enclosure(x) and use the prefix P to replace prefix(x) (i.e. now, prefix(x)

becomes a non-most specific prefix and prefix P is a most specific prefix).

If there has no node x such that prefix P encloses prefix(x) or is enclosed by prefix(x),

we must insert a new node y into the MSPT and define the key value, key(y), port value,

port(y), length value, length(y) and initialize the enclosure set of node y, enclosure(y). This

node y will represent the prefix P in MSPT (i.e., now prefix P is a most specific prefix).

The procedure of inserting a new node into the MSPT can be done by using the function

Insert_node(P) (use balanced binary tree node delete algorithm). However, inserting a new

node into the MSPT may cause to the balanced binary search tree become unbalanced. The

rebalancing procedure requires at least one rotation and may lead to a violation of the

MSPT enclosure constraint. We discuss the rebalancing rotation problem in Section 4.4.3.

Exclusive of the time required to perform the tasks associated with a rebalancing

rotation, the time required to insert a prefix is O(height(MSPT) + height(enclosure())) =

O(log N + log W).

39

Figure 4.7 Algorithm to insert a prefix.

4.4.2 Deleting a Prefix

The procedure of deleting a prefix is more complex than that inserting a prefix into a

prefix set. Figure 4.8 gives our algorithm to delete a prefix P. First, we have to determine

the prefix P is a non-most specific prefix stored in an enclosure set or it is a most-specific

prefix represented by a node x in MSPT. If the prefix P is former, we just only remove

prefix P from the enclosure set. Otherwise, if this prefix is a most specific prefix

represented by a node x in MSPT, we have to perform a delete_internal_prefix_node

operation that is necessary to maintain the MSPT enclosure constraint. Figure 4.9 gives the

Algorithm insert (P, root) { /* insert a new prefix P at root node */ /* prefix(x) is the most specific prefix represented by node x in MSPT */ x = root; /* root node of MSPT */ while ( x ≠ null ){ if ( P ⊆ prefix(x) ) { /* prefix(x) encloses prefix P */ /* inserting prefix(x) into enclosure(x) and then using prefix P to

replace prefix(x) */ if (P = prefix(x) ) return; insert prefix(x) into enclosure(x); prefix(x) = P;

return; }else if ( prefix(x) ⊆ P ) { /* prefix P encloses prefix(x) */

insert P into enclosure(x); return;

} /* if prefix P does not encloses prefix(x) or it is not enclosed by prefix(x), prefix P must disjoint with prefix(x) */ if ( P > prefix(x) ) /* the start address of P is larger than the key, key(x) */

x = RightChild(x); else x = LeftChild(x); /* the start address of P is smaller than the key, key(x)

} /* P is disjoint with all the most specific prefixes in MSPT create a new node and insert into MSPT */

Insert_node(P); }

40

steps in the method delete_internal_prefix_node.

Notice that the deletion of a most specific prefix represented by a node x in MSPT,

this node x may be a leaf node or an internal node in MSPT. If node x is a leaf node and the

enclosure(x) is empty, node x is deleted from the MSPT and a rotation is done as described

in Section 4.4.3 if the MSPT becomes unbalanced. Otherwise, if the enclosure(x) is not

empty, we need to use the longest non-most specific prefix stored in enclosure(x) to replace

prefix(x) and delete this non-most specific prefix from enclosure(x).

Performing the delete_internal_prefix_node function is necessary to maintain the

MSPT enclosure constraint when node x is an internal node. If the degree of node x is 1,

inserting all non-most specific prefixes stored in enclosure(x) into the subtree of node x

and following the binary tree node deletion algorithm to delete node x. However, when the

degree of node x is 2. Let y be the node with the largest key in the left subtree of node x or

the node with the smallest key in the right subtree of node x, temp is a temporary memory

space and P1 to Ps be nodes in the path from y to x, so P1 = y, Ps = x. First, use temp to

store all prefixes in enclosure(x) and then node y will replace node x after performing

balanced binary search delete algorithm. Second, insert all prefixes stored in temp at node y.

Third, delete the prefixes that enclose prefix(y) in enclosure(Pi), for i=1 to s, and then add

those prefixes into enclosure(y) to satisfy the MSPT enclosure constraint.

Besides the time requirement of rotation operation, the complexity of deleting a prefix

P from a prefix set is the O(log N) time needed to find the node x such that prefix(x) is

equal to the prefix P or is enclosed by prefix P, pluses O(Wlog N) time to perform the

delete_internal_prefix_node function. Consequently, it may take O(Wlog N) time to

perform a deletion operation.

41

Figure 4.8 Algorithm to delete a prefix.

Algorithm delete (P, root) { /* delete a prefix P */ x = root; /* root of MSPT */ while ( x ≠ null ){ if ( P = prefix(x) ){ /* prefix(x) equals to prefix P */ if ( x is a leaf node ){ if ( enclosure(x) is not empty ){

delete the prefix A which is the longest prefix in enclosure(x); prefix(x) = A; } else Delete_node(x); /* enclosure(x) is empty */ return;

} else { delete_internal_prefix_node(x); return; } /* x is a internal node */

} else if ( (prefix(x) ⊆ P) && (enclosure(x) is not empty) ) {

/* prefix P encloses prefix(x) */ if ( P exists in enclosure(x) ) delete P from enclosure(x);

return; } else { prefix P does not exist in MSPT; return; }

/* if prefix P does not encloses prefix(x) or is not enclosed by prefix(x), prefix P must disjoint with prefix(x). */

if ( P > prefix(x) ) /* prefix P is larger than prefix(x) */

x = RightChild(x); else /* prefix P is smaller than prefix(x) */

x = LeftChild(x); } }

42

Figure 4.9 Algorithm to maintain MSPT enclosure constraint following a delete.

Algorithm delete_internal_prefix_node(x) { /* delete the prefix of an internal node */ if ( the degree of node x is 1) { if ( enclosure(x) is empty) Delete_node(x); else /* enclosure(x) is not empty */

{ /* let y is the only child of x */

for each prefix q in enclosure(x) insert(q, y); Delete_node(x);

} } else /* the degree of node x is 2 */ {

/* let y be the node with the largest key value in the left subtree of node x or the node with the smallest key value in the right subtree of node x. node y will replace node x.

temp temporary stores all prefixes of the enclosure(x). */ temp = enclosure(x);

Delete_node(x); /* replace x with y */ for each prefix q in temp insert(q, y); /* let P1 to Ps be nodes in the path from y to x, so P1 = y, Ps = x */

for i=1 to s, delete the prefixes that enclose prefix(y) in enclosure(Pi), and then add those prefixes into enclosure(y) to satisfy the MSPT enclosure constraint.

} return; }

43

4.4.3 Rotations Problem

MSPT is a balanced binary search tree. When inserting/deleting a node into/from

MSPT, it requires at least one rotation to rebalance the MSPT if MSPT becomes

unbalanced. Those rotations may lead to a violation of the non-most specific prefix

allocation rule and make the search operation failure. For example, Figure 4.10 shows a

rebalanced MSPT after inserting a new prefix P6. However, we can find this rebalanced

MSPT (Figure 4.10(b)) conflicts with the non-most specific prefix allocation rule. As that

described in Section 4.2.2, based on MSPT, all non-most specific prefixes allocated to the

left subtree of a most specific prefix node x in MSPT must disjoint with prefix(x) and <

prefix(x) and those are allocated to the right subtree of node x in MSPT must disjoint with

prefix(x) and > prefix(x). But, in Figure 4.10(b), prefixes P3, P4 stored in enclosure(a)

(non-most specific prefix) are not disjoint with prefix(b) and P3, P4 enclose prefix(b). In

such case, an example to find the LPM for an address Dst = 00111 is described as

following. In Figure 4.10(b), the root node b, prefix(b) does not match the Dst and the

enclosure(b) is empty. Comparing the key value, key(b) and Dst. Since key(b) is smaller

than Dst, the search will continue to node c. At node c, prefix(c) also does not match the

Dst and the enclosure(c) is empty. Thus, we can’t find the LMP for address Dst. Obviously

the result is incorrect since the LPM should be P3 = 00*. Therefore, to ensure the search

operation can be performed correctly, we have to remove those two non-most specific

prefixes P3, P4 from the enclosure(a) and then insert them into the enclosure(b). Figure

4.11 is a correct MSPT after adjusting.

44

Figure 4.10 (a) A unbalanced MSPT after inserting a new prefix P6 (b)A rebalanced MSPT after performing a rebalancing rotation.

Figure 4.11 Correct MSPT after adjusting.

Insert a new prefix P6

rotation

( P1, 0001* )

( P2, 0010*)

( P6, 100*)

(P3, 00*) (P4, 0*) (P5, 000*)

a

b

c (a)

enclosure(a)

( P1, 0001*)

( P2, 0010*)

( P6, 100*)

( P3, 00*) ( P4, 0*) ( P5, 000*)

a

b

c (b)

enclosure(a)

( P1, 0001*)

(P2, 0010*)

( P6, 100*)

(P5, 000*)

(P3, 00*) (P4, 0*)

b

a c

enclosure(a)

enclosure(b)

45

Balanced binary search tree can be implemented by Red-Black tree or AVL tree. The

LL and RR rotations used to rebalance a red-black tree following an insert or delete

operation are show in Figure 4.12. However, for the AVL tree, besides LL and RR rotations,

it also causes the LR and RL rotations. We may respectively view the LR and RL rotations

as a RR rotation followed by an LL rotation and an LL rotation followed by an RR rotation.

Figure 4.13 shows these two rotation types.

In Figure 4.12, we observe that in the balanced binary search tree, the related position

of node a and node b changes after performing a LL or RR rotation. Node a becomes a

node in the subtree of node b. To avoiding a violation of the non-most specific prefix

allocation rule, we have to find a set S, such that after performing a rebalancing rotation,

enclosure(b) = enclosure(b)�S and enclosure(a) = enclosure(a) – S, where S = { p | p ∈

enclosure(a) and p encloses prefix(b) } (i.e., delete those non-most specific prefixes that

are stored in the enclosure(a) and enclose prefix(b), then insert them into the enclosure(b)).

The time required to perform an LL and RR rotation depends on the time to determine the

set S, remove S from enclosure(a), and then insert S into enclosure(b). As for LR and RL

rotations, the time is roughly twice that for LL and RR rotations.

Since all prefixes stored in the enclosure set are enclosure. To find the set S, we can

just find the prefix pMax with the longest prefix length that encloses prefix(b) in the

enclosure(a) Moreover, since the data structure of each enclosure set is a balanced binary

search tree of an ordered set of prefix lengths, the prefix pMax can be found in

O(height(enclosure(a)) time by following a path from the root to leaf node. If the pMax

exists, we can use the split [7] operation to extract the prefixes that belong to S from

enclosure(a). We first separate enclosure(a) into a red-black tree pSmall in which all

prefixes with smaller prefix length than pMax and a red-black tree pBig in which all

prefixes with longer prefix length than pMax, and then we use the join [7] operation to

46

combine the red-black tree pSmall, the prefix pMax, and the red-black tree enclosure(b)

into a single red-black tree. Thus, we see that enclosure(a) = pBig, and enclosure(b) =

join(pSmall, pMax, enclosure(b)) after performing a rebalancing rotation. Although, the

split and join operations of [7] need to be modified slightly, this modification does not

affect the complexity. So, the complexity of performing a LL or RR (include LR and RL

rotations) rotation in the enclosure set is O(log W).

Since a rebalancing rotation can be done in O(log W) and we have described in

Section 4.4.1 and Section 4.4.2 that the time required to insert and delete a prefix is O(log

N + log W) and O(Wlog N) without counting the time for performing a rebalancing

rotation. So, the overall insert and delete time are still O(log N + log W) and O(Wlog N +

log W).

Figure 4.12 LL and RR rotations.

a

b

ar br bl

LL a

b

ar

bl

bl

RR a

b

bl

a

b

br bl

al

al br

(a): LL rotation

(b): RR rotation

47

Figure 4.13 LR and RL rotations.

4.5 Enhance Our IP Address Lookup Scheme

By analyzing several real routing tables obtained from [1], [14], we observe that about

91% ~ 93% prefixes in a routing table are the most specific prefixes and the remainders are

the non-most specific prefixes. Based on our scheme, we use all of the most specific

prefixes in a routing table to construct a balanced binary search tree, and the remaining

prefixes (non-most specific prefixes) are allocated to the enclosure set of each most

specific prefix node in MSPT. In Table 4.2, we further analyze that almost 91% ~ 92%

enclosure sets are empty. Excluding those empty enclosure sets, we can see that the

average size of nonempty enclosure sets is slight (average number of non-most specific

prefixes in an nonempty enclosure set) and the max size of nonempty enclosure sets is 4 or

a

b

br

al c

cr cl

a

c

br

al b

cr

cl

RR LL

a

b

ar bl

c

cr cl

RR

a

c

ar

bl

b cr

cl

LL

(a): LR rotation

b

c

br cr

a

cl al

(b): RL rotation

a

c

ar cr

b

cl bl

48

5. Thus, for a real routing table, the enclosure set of each most specific prefix node in

MSPT is either empty or stores a few amounts of non-most specific prefixes.

Table 4.2 Data structure analysis for MSPT.

In our implementation, we use the key value and length value stored in each node of

the MSPT to represent a most specific prefix. In order to use memory more efficient, the

prefix representation method and comparison rule scribed in [3] can be adopted. Based on

[3], a prefix can be only represented by a key value without saving any length information

in each node of the MSPT. Moreover, according to our statistical analysis for several

routing tables showed in Table 4.2 and the fact that a prefix is at most enclosed by six

prefixes [12], [13] in a real routing table. We may expect to get better performance by

using a simpler structure for an enclosure set than the structure (balanced binary search tree)

described in Section 4.2.2. We replace the balanced binary search tree in each enclosure set

with an array, of pairs of the form (prefix length, port). The pairs in this array are in

ascending order of prefix length.

We have described that inserting or deleting a prefix may cause the rotation problem

and redistribute some non-most specific prefixes stored in the enclosure set. The time

required to do a rotation and redistribute non-most specific prefixes depends on the data

Database

(year-mouth)

AS6447

(2000-4)

AS6447

(2002-4)

AS6447

(2005-4)

AS7660

(2005-4)

AS2493

(2005-4)

# of prefixes

# of repeated prefixes

# of most specific prefixes

# of non-most specific prefixes

# of empty enclosure sets

# of nonempty enclosure sets

Max size of nonempty enclosure sets

Average size of nonempty enclosure sets

79560

25

73900

5635

68639

5261

4

1.07

124824

21

114745

10058

105561

9184

5

1.09

163574

32

150245

13297

138114

12131

4

1.09

159816

120

145849

13847

133297

12552

4

1.10

157154

133

143684

13337

131520

12164

4

1.10

49

structure used to represent the enclosure set and the number of non-most specific prefixes

stored in the enclosure set. As noted earlier, a prefix is at most enclosed by six prefixes in

practice. So, in practice, MSPT can take O(log N) time to deal with per operation (search,

insertion, deletion).

4.6 Migrate to IPv6

In this section, we sketch an extension of our scheme to IPv6. While IPv4 uses 32 bit

IP addresses, the next generation, IPv6, will use 128 bit addresses. On a 32-bit machine,

we need four words (one word is 32 bits) to present an IPv6 address. It leads to access a

single 128-bit address will require four memory accesses. Thus, the performance could

suffer a slowdown. Moreover, IPv6 has the characteristic of hierarchical addressing. How

to use this characteristic to make routing more efficiently is an important problem.

Since we need four words to present an IPv6 prefix and IPv6 has the characteristic of

hierarchical addressing, we can divide each IPv6 address into four parts, each parts is 32

bit (a word). In other words, we build a four levels MSPT structure in which each level

MSPT is constructed by a word (32 bit). Base on four levels MSPT structure, each level

MSPT just represents a segment of whole IPv6 addresses. It can reduce the tree height to

improve the performance. The constructing procedure is described as following: we first

initialize W=32 and build a level-one MSPT L1 on the first words of the prefixes in a prefix

database S. We observe that the prefixes of S fall in two categories: (1) those with prefix

length fewer than W or equal to W are treated as a normal prefixes (2) those with prefix

length more than W are treated as prefixes with prefix length is 32 and continue to insert

those prefixes into next level. The procedures of constructing level-two, level-three and

level-four are similar as that in level-one, but the vale W need to be changed to 64, 96 and

50

128 for two to four level. Figure 4.14 gives an example of seven IPv6 prefixes. Since the

max prefix length in these seven IPv6 prefixes is 48, we don’t build the level-three and

level-four MSPT in our constructing structure.

Figure 4.14 An MSPT example for IPv6.

(2001:0200, 32)

(3FFE:0100, 32)

(3FFE:0608, 32)

P7, (3FFE:1100 24, G)

P4

a

b c

d

enclosure(a)

P3 (0600:0000, 8, C)

P1 (0000:0000, 8, A)

P2 (0500:0000, 8, B)

P5 (0002:0000, 16, E)

P6 (1002:0000, 16, F)

Level-one MSPT

Level-two MSPT

prefix port

P1 2001:0200:0000::/40 A

P2 2001:0200:0500::/40 B

P3 3FFE:0100:0600::/40 C

P4 3FFE:0100::/24 D

P5 3FFE:0608:0002::/48 E

P6 3FFE:0608:1002::/48 F

P7 3FFE:1100::/24 G

51

Chapter 5 Performance

In this chapter, we first introduce the simulation environment and then divide the

simulation results into two parts: one is the simulation results for IPv4 and the other is that

for IPv6. In each part, we compare these different IP address lookup schemes in turn of the

memory requirement, lookup speed and update.

5.1 Simulation Environment

All tested schemes are implemented in C and run on a 2.4G Pentium IV processor,

8KB L1, 256KB L2 caches and 768MB main memory running Redhat 9.0. The gcc-3.2.2

compiler with optimization level –O4 is used. Moreover, a special instruction called

RDTSC (read time stamp counter) is used to estimate the performance of search and

update. The time-stamp counter can keep an accurate count of every clock cycle that

occurs on the processor. Thus, the time unit of measurement can be obtained by using this

RDTSC instruction.

5.2 Simulation Result for IPv4

Our experiments are conducted by five BGP routing tables with different size

obtained from [1], [14]. Those BGP routing tables reflect the realistic size of the routing

tables in the backbone routers currently deployed on the Internet. The detailed information

is showed in Table 5.1.

Table 5.1 BGP routing tables.

Database

(AS number) AS6447 AS6447 AS6447 AS7660 AS2493

year-mouth 2000-4 2002-4 2005-4 2005-4 2005-4

# of prefixes 79560 124824 163574 159816 157154

52

We experiment with using the following non precomputation-based (dynamic routing

tables structure) schemes as the PBOP (prefix binary tree on the binary tree structure [12]),

MRT (multiway range tree structure [21]), BTRIE (binary trie) and our proposed scheme

MSPT (most specific prefix tree). Moreover, we also include some precomputation-based

schemes as BRS (binary range search [11]), BPS (binary prefix search [3]), HCA (Huang’s

compact algorithm [8]), BLS (binary length search [18]) and Lulea (Lulea compressed trie

[2]) in our performance measurements to compare with the MSPT. The schemes whose

name ends with a number "16" (for example, PBOB-16) are variants of the corresponding

pure schemes. For example, PBOB-16 uses the first 16 bits of IP address to maintain a

segmentation table. Many IP address lookup schemes use this method to further improve

their lookup speed and update performance. The concept of the 16-bits segmentation table

is showed in Figure 5.1.

Figure 5.1 16-bits segmentation table.

Total Memory Requirement. Table 5.2 shows the amount of memory used by each

of the tested schemes. These memory requirements are histogrammed in Figure 5.2. Since

precomputation simplifies entire search data structure, we may find that the

precomputation-based schemes indeed have the smaller memory requirement. But, when

we talk about the non precomputation-based schemes, our MSPT has the best performance

among all schemes. Comparing with the PBOB and MSPT, we find that MSPT structure

0 1 2 3 … i … 65535.

16-bits segmentation table (216entries)

segment-1 segment-3 segment-i segment-65535

53

uses about 30% less memory than is used by the PBOB structure. This result can be

attributed to that the number of nodes in MSPT is always equal or less than that in PBOB.

Moreover, less than 1 percent of range sets in the constructed PBOB are empty. It needs

additional memory spaces to store those nonempty range sets (each nonempty range set is

constructed by an array structure with six entries). On the contrary, almost all enclosure

sets in MSPT are empty. Hence, MSPT always requires less memory space than PBOP. As

for BTIRE and MRT, they both need larger memory spaces to store the long sequences of

one-child nodes and link pointers. The detailed data structure analysis of non

precomputation-based schemes is showed in Table 5.3. This result of the memory

performance is still the same as when we adopt a 16-bits segmentation table to improve

lookup speed.

Table 5.2 The statistics of memory requirement (in KB) for IPv4.

scheme AS6447 2000-4

AS6447 2002-4

AS6447 2005-4

AS7660 2005-4

AS2493 2005-4

Non Precomputation-Based (dynamic routing tables structure) PBOB 2122 3299 4317 4192 4129 MSPT 1433 2237 2930 2853 2809 BTRIE 2789 4231 5383 5147 5056 MRT 3665 5689 7399 7220 7099 PBOB-16 2465 3550 4484 4356 4298 MSPT-16 1828 2533 3143 3068 3027 BTRIE-16 3167 4555 5655 5397 5308

Precomputation-Based (static routing tables structure) BRS-16 1474 1972 2391 2336 2308 BPS-16 1006 1223 1399 1359 1345 HCA-16 1298 1876 2072 2019 2005 BLS 1459 2315 2817 3074 3292 Lulea 521 802 946 921 895 PBOP (prefix binary tree on the binary tree [12]) MSPT (most specific prefix tree), BTRIE (binary trie) MRT (multiway range tree [21], 32 way) BRS (binary range search [11]), BPS (binary prefix search [3]) HCA (Huang’s compact algorithm [8]), BLS (binary length search [18]) Lulea (Lulea compressed trie [2])

54

�

� ��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

� ��

� � ��

� � ��

� � ��

� � ��

� � ��

� � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��

��

��

� � �

��

� � �

��

� � � � � �

��

��

� � � � �

� � � � � �

��

� � � � �

Figure 5.2 Total memory requirement (in KB) for IPv4.

Table 5.3 Data structure analysis for non precomputation-based schemes.


AS6447 2002-4

AS6447 2005-4

AS7660 2005-4

AS2493 2005-4

PBOB # of nodes: # of empty range sets:

75075

319

116722

562

152741

707

148343

785

146109

727 MSPT

# of nodes: # of empty enclosure sets:

73900 68639

114745 105561

150245 138114

145849 133297

143684 131520

BTRIE # of nodes:

237976

361106

459351

439246

431516

MRT # of nodes:

7303

11338

14743

14381

14140

PBOB-16 # of nodes: # of empty range sets:

69353

309

110620

548

146109

701

141267

730

139038

691 MSPT-16

# of nodes: # of empty enclosure sets:

68484 64698

109029 101958

144072 134549

139252 129356

137046 127510

BTRIE-16 # of nodes:

215664

334058

427935

405993

398345

55

Search Time. To measure the lookup speed, we also conduct trace-drive simulations

to obtain the lookup time distributions of the tested schemes. A simulated IP traffic

described as following. We first use an array A to store the start address of all prefixes in a

database and then add one to each of these start address. A random permutation of A is

generated and this permutation determines the order in which we search for the longest

prefix match for each of addresses in A. The time required to determine all the LPM is

measured and averaged over the number of addresses in A. The experiment is repeated 100

times, and the mean of these average times is computed. These mean times are reported in

Table 5.4 and the mean times are also histogrammed in Figure 5.3.

Table 5.4 The statistics of search time (in Microsecond) for IPv4.


AS6447 2002-4

AS6447 2005-4

AS7660 2005-4

AS2493 2005-4

Non Precomputation-Based (dynamic routing tables structure) PBOB 0.57 0.69 0.79 0.77 0.72 MSPT 0.47 0.60 0.68 0.68 0.66 BTRIE 0.63 0.70 0.75 0.74 0.73 MRT 0.42 0.47 0.52 0.51 0.52 PBOB-16 0.34 0.38 0.43 0.42 0.41 MSPT-16 0.25 0.32 0.36 0.34 0.34 BTRIE-16 0.48 0.54 0.58 0.56 0.55

Precomputation-Based (static routing tables structure) BRS-16 0.22 0.28 0.32 0.31 0.31 BPS-16 0.15 0.18 0.21 0.19 0.19 HCA-16 0.15 0.18 0.18 0.20 0.20 BLS 0.23 0.35 0.40 0.42 0.41 Lulea 0.18 0.21 0.23 0.23 0.22

56

�

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

� � � � � � � � � � � � � � � � � � � � �

��

��

� � � �

� � �

� � � � �

� � �

� � � � � ��

� � � � ��

� � � � � � ��

� � � ��

� � � ��

� � � � ��

� �

� � � � �

Figure 5.3 Search time (in Microsecond) for IPv4.

First, only consider the non precomputation-based schemes. Exclusive of using the

16-bits segmentation table, we find that the performance of PBOB becomes bad than

binary trie when the number of entries in BGP table becomes large and the MRT has the

best performance among all schemes. We attribute the MRT has the best search

performance to that MRT has lower tree height than PBOP, MSPT and binary tree.

Considering the MSPT and PBOB, the lookup time for MSPT is about 90% that of PBOB.

This is because each node in MSPT represents a prefix in a routing table. Almost all

enclosure sets in MSPT are empty (i.e., few amounts of prefixes in a routing table are

stored in all enclosure sets of MSPT). For the PBOB, however, each node in PBOB just

presents a point. All prefixes in a routing table must be stored in all range sets of PBOB. It

leads less than 1% range sets are empty. When the search procedure traverses from root

57

node to leaf node, almost the enclosure set of each node in MSPT is empty, it does not take

extra time to check whether there exits other prefixes also match the destination address,

but checking the range set of each node in PBOB is always necessary due to almost the

range set of each node in PBOB is not empty. Thus, our MSPT indeed has the better search

performance than PBOB.

When considering the precomputation-based schemes and the 16-bits segmentation

table. The search performance of MSPT-16 is very near to those precomputation-based

schemes. This is because after adopting the 16-bits segmentation table, the number of

prefixes in each segment is much smaller than the number of prefixes in original prefix

database. The few amounts of prefixes in each segment reduce the difference of each IP

address lookup schemes. Table 5.5 gives the statistics of 16-bits segmentation table for

each prefix database.

Table 5.5 The statistics for 16-bits segmentation table.

Update Time. For the average update (insert/delete) time, we start by selecting 2000

prefixes from the database. Excluding those 2000 prefixes, we first use the remaining

prefixes of the database to build the data structure. After constructing the data structure, the

selected 2000 prefixes are inserted into the data structure. Once the 2000 insertions are

Database

(AS number) AS6447 AS6447 AS6447 AS7660 AS2493

# of entry 79560 124824 163574 159816 157154

# of nonempty segments 4784 6882 8698 9160 9070

max size of nonempty segments 261 297 2078 243 243

Average size of nonempty segments 15.14 15.48 17.74 16.37 16.25

58

done, the selected 2000 prefixes are removed from the database. The total elapsed time of

inserting 2000 prefixes and removing 2000 prefixes is divided by 4000 to get the average

time for a single update. This experiment is repeated 10 times and the mean of the average

update times is computed.

However, this measurement is not suitable for the precomputation-based schemes,

because inserting a prefix into a precomputation-based scheme may affects entire data

structure. In order to measure their update time (rebuild time), for BRS-16, BPS-16 and

HCA-16, the update time can be obtained by calculating the time of maintaining a 16-bits

segmentation table and pulsing the rebuilding time of a segment. As for BLS, the time of

creating all makers when inserting a prefix and the time of finding the LPM of all makers

should be counted. Finally, we have to calculate the time of rebuilding entire data structure

for the Lulea.

Table 5.6 gives the computed mean times and Figure 5.4 histograms these mean times

(exclusive of binary length search and Lulea compressed trie). Obviously, the update

performance of non precomputation-based schemes is much better than that of the

precomputation-based schemes. Exclusive of using the 16-bit segmentation table, PBOB

and MSPT have almost the same performance on update, and the performances of them are

the best among all schemes. But, when using the 16-bits segmentation table, it makes the

MSPT-16 have the best performance among all tested schemes.

59

Table 5.6 The statistics of update time (in Microsecond) for IPv4.

�

�

�

�

�

� �

� �

� �

� �

� �

��

��

��

��

��

� �

� �

� �

� �

� �

��

� ��

��

��

� �

� � �

� � � �

� � �

� � � � �

� � � � � �

� � � � � � �

� � � �

� � � �

� � � � � �

Figure 5.4 Update time (in Microsecond) for IPv4.


AS6447 2002-4

AS6447 2005-4

AS7660 2005-4

AS2493 2005-4

Non Precomputation-Based (dynamic routing tables structure) PBOB 1.15 1.17 1.25 1.21 1.23 MSPT 1.15 1.15 1.24 1.21 1.17 BTRIE 1.19 1.31 1.39 1.35 1.32 MRT 1.67 1.78 1.80 1.84 1.86 PBOB-16 0.97 0.99 1.01 1.02 0.99 MSPT-16 0.76 0.79 0.82 0.82 0.78 BTRIE-16 0.77 0.81 0.87 0.83 0.88

Precomputation-Based (static routing tables structure) BRS-16 9.06 9.59 9.90 9.16 9.14 BPS-16 5.75 6.41 6.73 6.12 6.27 HCA-16 14.93 17.50 28.95 14.87 14.46 BLS 132.78 210.01 266.77 259.16 253.76 Lulea 1454.9 1633.9 1891.3 1864.5 1885.4

60

5.3 Simulation Result for IPv6

In order to measure the performance of IP address lookup algorithms for IPv6, IPv6

routing tables are needed. Since there are few users on IPv6 at present, current IPv6 table

sizes are small and unlikely to reflect future IPv6 network growth. In our experiment, we

use four IPv6 prefix databases. The databases V6table-1 and V6table-2 obtained from [1]

are the real routing tables in the IPv6 backbone routers. The databases GV6table-1 and

GV6table-2 are generated by two real IPv4 routing tables. The IPv6 table generation

schemes are described in [20].

Although many proposed IPv4 schemes adopt 16-bits segmentation table to speed up

the lookup speed, but for IPv6, the address length becomes 128, we think the 16-bits

segmentation table is no longer suitable. In this Section, we only experiment the non

precomputation-based schemes as MSPT, PBOP and BTIRE. The implementation of

PBOB for IPv6 is the same as that of MSPT described in Section 4.6.

Total Memory Requirement. Table 5.7 shows the memory requirement of each

tested schemes, and the detailed data structure analysis is showed in Table 5.8. The

experimental results for IPv6 are the same as those for IPv4. We find that MSPT has the

best memory performance among all schemes. Figure 5.5 is the histogram of the memory

requirement.

Table 5.7 The statistics of memory requirement (in KB) for IPv6.

Database V6table-1 V6table-2 GV6table-1 GV6table-2

# of entries 274 593 9788 20070

MSPT 9.9� 18.3� 312� 605�

PBOB 12.5� 24.1� 386.5� 763.1�

BTRIE 22.6� 29.2� 822.2� 1786.5�

61

Table 5.8 Data structure analysis for IPv6 prefix databases.

�

� ��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

� ��

� � ��

� � ��

� � ��

� � ��

� � ��

� � ��

� � ��

� ��

��

� � � � � � �

��

� � �

� � � �

� � � � �

Figure 5.5 Total Memory requirement (in KB) for IPv6.

scheme V6table-1 V6table-2 GV6table-1 GV6table-2

# of prefixes 274 593 9788 20070 MSPT

# of nodes: # of empty enclosure set:

361 325

675 632

11491 10723

22412 21219

PBOB # of nodes: # of empty range set:

382 115

706 123

11866 3604

23024 5870

BTRIE # of nodes:

2313

2992

84195

182941

62

Search and Update Time. The search and update performance are showed in Table

5.9. Figure 5.6 histograms the mean search time and update time. As can be seen, for small

tables as V6table-1 and V6table-2, MSPT has the similar performance with PBOB. But for

the large tables as GV6table-1 and GV6table-2, MSPT is slightly better than PBOB.

However, BTRIE always has the poor performance among all tested schemes. This is

because the performance of BTRIE grows linearly with the prefix length and thus it does

not scale well to longer IP addresses (i.e., BTRIE is a trie based schemes, we have

introduced in Chapter 3 that trie based schemes usually do not scale well to longer IP

addresses).

Table 5.9 The statistics of search and update time (in Microsecond) for IPv6.

Database V6table-1 V6table-2 GV6table-1 GV6table-2

# of prefixes 274 593 9788 20070

MSPT 0.249� 0.318� 0.415� 0.532�

PBOB 0.255� 0.356� 0.516� 0.752�Search

BTRIE 0.392� 0.48� 0.79� 0.943�

MSPT 0.562� 0.631� 0.678� 0.734�

PBOB 0.592� 0.655� 0.721� 0.792�Update

BTRIE 1.539� 1.183� 1.723� 1.79�

63

Figure 5.6 The mean times of search and update for IPv6.

Update Search �

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��

��

��

��

� � � �

� � � �

� � � � �

�

��

��

��

��

�

��

��

��

��

�

��

��

��

��

�

��

��

��

��

�

� � � � � � � � � � � � � � � � � � � � � � � � � �

��

��

��

��

� � � �

� � � �

� � � � �

64

Chapter 6 Conclusion

We have developed a Most Specific Prefix Tree (MSPT) data structure that is suitable

for the representation of dynamic routing tables. Our MSPT is a balanced binary search

tree which is constructed by all most specific prefixes in a routing table. The rest of

prefixes (non-most specific prefixes) are allocated to the enclosure set of each most

specific prefix node in MSPT. Based on our MSPT, the search, insertion, and deletion

operations can be finished in O(log N) time for a real routing table, where N is the number

of prefixes.

Comparing with some schemes that are suitable for dynamic routing tables and

several precomputation-based schemes, our experiments show that MSPT is to be preferred

over PBOB, MRT and BTRIE for the representation of dynamic routing tables and the

performance of lookup speed and memory requirement is near to those

precomputation-based schemes. Moreover, our scheme also can scales well to IPv6 and

large routing tables.

Balanced binary search tree may need more memory accesses to accomplish the

search, update operations when the number of prefixes in the routing table becomes large.

In the future, we try to use the balanced m-way search tree to construct our MSPT. By

adopting this balanced m-way search tree, we think the lower tree height can reduce the

memory accesses to get better performance.

65

Reference

[1] BGP Routing Table Analysis Reports, http://bgp.potaroo.net/.

[2] A. Brodnik, S. Carlsson, M. Degermark, S. Pink, "Small Forwarding Tables for Fast

Routing Lookups," ACM SIGCOMM, pp. 3-14, Sept. 1997.

[3] Y. K. Chang, "Fast Binary and Multiway Prefix Searches for Packet Forwarding,"

Submitted for publication.

[4] S. Deering and R. Hinden, RFC 2460 Internet Protocol, Version 6 (IPv6)

Specification.

[5] A. Durand and C. Huitema, RFC 3194 The H-Density Ratio for Address Assignment

Efficiency An Update on the H ratio.

[6] V. Fuller, T. Li, J. Yu and K. Varadhan, "Classless inter-domain routing (CIDR): an

address assignment and aggregation strategy," RFC 1519, Sept. 1993.

[7] E. Horowitz, S. Sahni, and D. Mehta, Fundamentals of Data Structure in C++. New

York: W.H. Freeman, 1995.

[8] N. F. Huang, S. M. Zhao, J. Y. Pan, and C. A. Su, "A Fast IP Routing Lookup Scheme

for Gigabit Switching Routers," in Proc. INFOCOM, pp. 1429-1436, Mar. 1999.

[9] IPv6 Forum, http://www.ipv6forum.com.

[10] K. Kim, S Sahni, "An O(logn) Dynamic Router-Table Design," IEEE Transactions on

Computers, pp. 351-363, Mar. 2004.

[11] B. Lampson, V. Srinivasan and G. Varghese, "IP Lookups Using Multiway and

Multicolumn Search," IEEE/ACM Transactions on Networking, Vol. 3, No. 3, pp.

324-334, Jun.1999.

[12] H. Lu, S. Sahni, "Enhanced Interval Tree for Dynamic IP Router-Tables," IEEE

Transactions on Computers, pp. 1615-1628, Dec. 2004.

66

[13] X. Meng, Z. Xu, B. Zhang, G.. Huston, S. Lu, L. Zhang, "IPv4 Address Allocation

and the BGP Routing Table Evolution," ACM SIGCOMM, pp. 71-80, Jan. 2005.

[14] D. Meyer, "University of Oregon Route Views Archive Project", at

http://archive.routeviews.org/.

[15] S. Nilsson and G. Karlsson "IP-Address Lookup Using LC-trie," IEEE Journal on

selected Areas in Communications, 17(6):1083-1092, June 1999.

[16] K. Sklower, "A Tree-based Packet Routing Table for Berkeley Unix," Proc. Winter

Usenix Conf, pp. 93-99, 1991.

[17] M. A. Ruiz-Sanchez, Ernst W. Biersack, and Walid Dabbous, "Survey and Taxonomy

of IP Address Lookup Algorithms," IEEE Network Magazine, 15(2):8--23,

March/April 2001.

[18] M. Waldvogel, G. Varghese, J. Turner and B. Plattner, "Scalable High-Speed IP

Routing Lookups," ACM SIGCOMM, pp. 25-36, Sept. 1997.

[19] P. C. Wang, C. T. Chan and Y. C. Chen, "An Efficient IP Routing Lookup by Using

Routing Interval," Journal of Communication and Networks, pp. 374-382, Mar. 2001.

[20] M. Wang, S. Deering, T. Hain, L. Dunn, "Non-random Generator for IPv6 Tables," in

Proc. 12th Annual IEEE Symposium of High Performance Interconnects, pp. 35-40,

Aug. 2004.

[21] P. Warkhede, S. Suri, G.. Varghese, "Multiway Range Trees: Scalable IP Lookup with

Fast Updates," The International Journal of Computer and Telecommunications

Networking, pp. 289-303, Feb. 2004.

��

��

� � � ��

��

�

� � ��

� �� ! " # $ % & ' ( ) �

� �� * � ��+ , � � - . / � 0 ( ) �

� �� * 1��2 3 4 5 + � ( ) �

�

6 7 � � �8 9 : ; < = > ? * 1 @ A1 B �C> D AE�

F G �H I AJ A� >� 11��

Documents

Design and Implementation of Dynamic Routing Tablescial.csie.ncku.edu.tw/st2008/pdf/Design and Implementation of Dyna… · that can build the routing tables dynamically are suitable