10
Building Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica de Catalunya (UPC) Abstract This document (reference [26]) contains additional information regarding the design of Nemo and the configuration of routers. Specifically, it describes: 1) how routers can be forced to announce their direct routes to neighbors (i.e. make them visible) as needed to ensure that changes in interface costs are detected and that the deduced interface costs are up-to-date) 2) our interim solution to detect mutipath BGP routes in use 3) how the SPT calculation should be modified to work with hierarchical IGP setups and 4) our second implementation of the RS. 1. Enforcing routers to advertise their direct routes toward adjacent neighbors Forcing routers to announce their direct routes (if there) to- ward adjacent neighbors when several exist can be easily done but be counter-intuitive in some devices and deserves an expla- nation. Suppose a router X has 3 neighbors A, B and T (with link addresses a, b and t). Fig. 1 shows a possible configura- tion (in Cisco-IOS-like syntax, using a route-map construct) to force X to announce its direct route toward the LO of T . router bgp ASN redistribute ospf [..] route-map FORCE_DIRECTS route-map FORCE_DIRECTS permit 10 (1) match ip prefix LOT (2) match ip next-hop t (3) set ip next-hop t Figure 1: Note that a real configuration may require referring to access-lists or prefix-lists in the match clauses of the route-map. Clause (1) ensures that the route-map matches only routes toward L T o . Clauses (2),(3) together seem non-sense. The idea is that match (2) shall be positive whenever X has a route (to- ward L T o ) having next-hop t (even if it has other next-hops); in which case the set statement forces the next-hop to be set to t. If X has no route with t as next-hop, (2) is not met, (3) not applied and the router advertises any of the routes according to some criteria. A route-map entry as the above is required for each router adjacent to X (except for stub ones, toward which only direct can routes exist). Many implementations support the route-map construct and the same behaviour may be achieved with JunOs import/export clauses, deemed more flexible. 2. An (unsuccessful) approach to “report” multiple IGP routes per prefix in BGP The above behaviour could be used to cause routers to “an- nounce” all their IGP routes to a destination when redistributing the IGP in BGP. This way, all routes would be visible, eliminat- ing the need to SNMP-unhide them. Let A = {a 1 , a 2 ..a n } be the set of link addresses (potential next-hops) of the routers adjacent to some router X. Since, when a route to some prefix Q is pulled from the FIB for re- distribution, all of its next-hops are exposed (i.e. all the routes to Q) in some set S Q , we could use this to tag the corresponding BGP route with a community identifying each next-hop. That is, if we let community C(a 1 ) represent a 1 and C(a 2 ) next-hop a 2 , we could add a filter (route-map) that appended one such community on each positive match –i.e. if (a k S Q ) then add C(a k )–; exploiting the fact that the COMMUNITY attribute is exten- sible. By cascading such tests, one per a i A, the correspond- ing BGP route could include as many communities as next-hops (i.e. routes), and, by mapping communities back to addresses, a monitor could infer the existence of routes: update to Q tagged as C(a 2 ) C(a 7 ) would imply 2 IGP routes to Q with next-hops a 2 and a 7 ; a subsequent update tagged as C(a 2 ) would mean that the route via C(a 7 ) ceased to be used. Further, as each community is 32 bits, we could even let C(a i ) = a i . A very simple way to implement the above would be with the continue clause in route-maps [1]. In normal operation, a route-map stops processing at the first matching entry, falling through the next otherwise. The continue clause can be used to continue execution, if a match occurs, to a subsequent entry (the next if none is specified). Thus, a configuration (using | A| entries) sketched in Fig. 2 could work. route-map INCLUDE_NEXT_HOPS permit 10 \ match ip next-hop a1 | one entry per set community C(a1) additive | address ai continue / route-map INCLUDE_NEXT_HOPS permit 20 match ip next-hop a2 set community C(a2) additive continue Figure 2: Adding a BGP community per next-hop (route) with a route-map and the continue clause. If a match in the first entry occurs, community C(a 1 ) is added and the second entry is evaluated (due to the continue). If the first fails, the second entry is still evaluated (normal route-map behavior). Preprint submitted to Computer Networks May 15, 2015

Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

Building Nemo: Extended discussion

Frederic Raspall

Department of Network Engineering, Universitat Politecnica de Catalunya (UPC)

Abstract

This document (reference [26]) contains additional information regarding the design of Nemo and the configuration of routers.Specifically, it describes: 1) how routers can be forced to announce their direct routes to neighbors (i.e. make them visible) asneeded to ensure that changes in interface costs are detected and that the deduced interface costs are up-to-date) 2) our interimsolution to detect mutipath BGP routes in use 3) how the SPT calculation should be modified to work with hierarchical IGP setupsand 4) our second implementation of the RS.

1. Enforcing routers to advertise their direct routes towardadjacent neighbors

Forcing routers to announce their direct routes (if there) to-ward adjacent neighbors when several exist can be easily donebut be counter-intuitive in some devices and deserves an expla-nation. Suppose a router X has 3 neighbors A, B and T (withlink addresses a, b and t). Fig. 1 shows a possible configura-tion (in Cisco-IOS-like syntax, using a route-map construct) toforce X to announce its direct route toward the LO of T .

router bgp ASNredistribute ospf [..] route-map FORCE_DIRECTS

route-map FORCE_DIRECTS permit 10(1) match ip prefix LOT(2) match ip next-hop t(3) set ip next-hop t

Figure 1: Note that a real configuration may require referring to access-lists orprefix-lists in the match clauses of the route-map.

Clause (1) ensures that the route-map matches only routestoward LT

o . Clauses (2),(3) together seem non-sense. The ideais that match (2) shall be positive whenever X has a route (to-ward LT

o ) having next-hop t (even if it has other next-hops); inwhich case the set statement forces the next-hop to be set to t.If X has no route with t as next-hop, (2) is not met, (3) notapplied and the router advertises any of the routes according tosome criteria. A route-map entry as the above is required foreach router adjacent to X (except for stub ones, toward whichonly direct can routes exist). Many implementations support theroute-map construct and the same behaviour may be achievedwith JunOs import/export clauses, deemed more flexible.

2. An (unsuccessful) approach to “report” multiple IGProutes per prefix in BGP

The above behaviour could be used to cause routers to “an-nounce” all their IGP routes to a destination when redistributing

the IGP in BGP. This way, all routes would be visible, eliminat-ing the need to SNMP-unhide them.

Let A = {a1, a2..an} be the set of link addresses (potentialnext-hops) of the routers adjacent to some router X. Since,when a route to some prefix Q is pulled from the FIB for re-distribution, all of its next-hops are exposed (i.e. all the routesto Q) in some set S Q, we could use this to tag the correspondingBGP route with a community identifying each next-hop. Thatis, if we let community C(a1) represent a1 and C(a2) next-hopa2, we could add a filter (route-map) that appended one suchcommunity on each positive match –i.e. if (ak ∈ S Q) then add

C(ak)–; exploiting the fact that the COMMUNITY attribute is exten-sible. By cascading such tests, one per ai ∈ A, the correspond-ing BGP route could include as many communities as next-hops(i.e. routes), and, by mapping communities back to addresses, amonitor could infer the existence of routes: update to Q taggedas C(a2) C(a7) would imply 2 IGP routes to Q with next-hopsa2 and a7; a subsequent update tagged as C(a2) would meanthat the route via C(a7) ceased to be used. Further, as eachcommunity is 32 bits, we could even let C(ai) = ai.

A very simple way to implement the above would be withthe continue clause in route-maps [1]. In normal operation, aroute-map stops processing at the first matching entry, fallingthrough the next otherwise. The continue clause can be usedto continue execution, if a match occurs, to a subsequent entry(the next if none is specified). Thus, a configuration (using |A|entries) sketched in Fig. 2 could work.

route-map INCLUDE_NEXT_HOPS permit 10 \match ip next-hop a1 | one entry perset community C(a1) additive | address aicontinue /

route-map INCLUDE_NEXT_HOPS permit 20match ip next-hop a2set community C(a2) additivecontinue

Figure 2: Adding a BGP community per next-hop (route) with a route-map andthe continue clause. If a match in the first entry occurs, community C(a1) isadded and the second entry is evaluated (due to the continue). If the first fails,the second entry is still evaluated (normal route-map behavior).

Preprint submitted to Computer Networks May 15, 2015

Page 2: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

Unfortunately, the continue clause may not be supportedby all devices. Moreover, in the latest releases of Cisco IOS, itcan only be used in route-maps for outbound announcements,which excludes redistribution.

3. Identifying and verifying BGP multipath routes

Routers can use multipath non-best BGP routes. By en-abling add_path, one can discover all the BGP routes a nodehas, but not if they are used. BGP multipath often comes intwo flavors, iBGP and eBGP. We focus on the former, as we aremostly concerned about routing in an AS. In either, not everynon-best route is eligible as multipath. In the iBGP case, its at-tributes must match those of the best (tie in the DP), except theNEXT_HOP, which must be at equal cost. Thus, non-best routescan enter the FIB depending on: 1) their attributes 2) the costto the NEXT_HOP and 3) limits on the number of paths to install.

Our interim workaround until routers signal such routes withadd_paths is as follows. When receiving a BGP route to prefixP/l from router X, Nemo checks all of X’s BGP routes towardP/l to compare PathIDs. If the new is a best one, it flags all thenon-best routes whose attributes match those of the best as twin(T), and as inactive otherwise. If the new route is not X’s best,its attributes are compared against those of the best (if there).With this, Nemo does not determine if non-best routes are inX’s FIB (i.e. active), but disqualifies some which surely aren’t.

For twin routes to be used, their NEXT_HOP must be at samecost as that of the best route; a condition that can change if IGProutes do. To avoid examining all twin routes on IGP changes,nhops store the cost to reach them, and best routes are put at thehead of route-lists. When twin routes are needed (on a query orwhen storing tables), their nhops are compared against that ofthe best (seen first) and, if their costs match, they are flaggedas multipath candidates (mC). Note that, as routers report an up-to-date (visible) IGP route per destination, their current IGPdistance to every prefix is known, so is that toward nhops if theseare resolved anytime IGP changes occur. Th does not suffice toassert if mC routes are used, but to detect when they could makeit to the FIB or surely leave it (and no longer be mC).

Although a small fraction of routes shall be mC, checking ifall are active via SNMP may be costly. Thus, we verify if a mC

route is active only if a user so requests, or it is needed (e.g. arecord to its prefix is received). The overhead of this approachcan be further reduced as follows. Most implementations re-quire the AS_PATH of routes to match that of the best (not just itslength). Hence, multipath occurs when two ASes peer at sev-eral points that announce the same paths. In that case, chancesare that the fate of many non-best routes coincide: either all areused or none. Thus, we can verify a subset of the routes and as-sume the same result for the rest. Alternatively, to derive trafficpaths, we can omit verifying a mC route to P/l if its nhop mapsto adjacencies already used to reach P/l by the best route.

4. SPF verifification of SNMP routes in hierarchical IGPdeployments

To assess the correctness of SNMP routes toward local routers(their Lo), Nemo computes the SPT of each router from routers’adjancency tables and the cost of interfaces. If the direct routestoward adjacent routers are visible, the cost of transit interfacesshall be up-to-date and the SPT computed for each router X,SPTX shall match the one computed by the router itself SPTX .

However, in hierarchical setups (e.g. when several OSPFareas or IS-IS levels exist), routers shall not know the existenceof routers and links outside the areas they belongs to. Since, asdescribed, Nemo does not know where area or level boundariesare, this could lead it to compute wrong SPTs1 since it would er-roneously include links and nodes whose existence some routerX would not know and thus never inlude in its SPT.

To see the problem, consider the setup in Fig. 3, with 3OSPF areas (a0, a1, a2) and router X ∈ a0. X’ LSDB would onlycontain links and routers in area a0, so would SPTX . If SPTX

included other nodes and links, it could still correctly validateroutes in case non-backbone areas were connected with a singleABR (e.g. B) since then, SPTX would match the concatenationof SPTX and SPTB. However, when areas are connected withmultiple ABRs (e.g. E and F in a2), this may cause SPTX todiffer from SPTX as the additional links erroneously consideredcould lead to the addition of paths of lower cost (branches inthe SPT) which routers would not use. For instance, in Fig. 3,SPTX (as computed by Nemo) could include the F-E link (inarea a2), while SPTX would never do so, and instead use theY − E or B − E links inside area a0.

Figure 3: X’s SPT should not span beyond area a0 since SPTX would not.

To solve the issue, we are extending Nemo as follows. Toprevent it from wrongly including nodes in SPTs, we let it knowthe set of areas AJ each router J belongs to and, from this infor-mation, let adjancencies be tagged with the area they pertain to.The SPF algorithm is modified (for some area ak) to disregardall adjacencies outside the area or toward any node v outsidethe area (any v | ak < Av). This simple change excludes allrouters outside an area, but not all the links it should disregard(e.g. the F-E). The problem could be solved if the tool exactlyknew the area every interface belonged to, which could be done

1IS-IS and OSPF differ in this regard. Here, we focus on OSPF, where areaboundaries occur within Area-Border Routers, ABRs.

2

Page 3: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

via the setup file. But, unlike for the AJ sets, this would requireextensive configuration and keeping it consistent with routers’.Luckily, as interfaces (and links) can belong to only one OSPFarea, the area to which most links pertain can be unequivocallydeduced as follows (we use this to tag adjacencies).

From the AJ sets, we can readily tell the area to which stubinterfaces connected to non-ABR (|AJ | = 1) routers pertain (e.g.S R, S Z , S Y ∈ a0 and S N ∈ a1). Similarly, we can know the areato which transit links pertain if any of the two neighbors, α andβ, resides in a single one (i.e. |Aα| = 1 OR |Aβ| = 1), as well as ifthey have only one area in common (|Aα∩Aβ| = 1). Thus, thereare only two cases where it may not be possible to determine towhich area an interface (or adjacency) belongs:

I) a transit interface such that |Aα ∩ Aβ| > 1; which requires|Aα| > 1 and |Aβ| > 1 to be true (i.e. α and β to be ABRs).This is the case for the F-E link, since AF∩AE = {a0, a2}.

II) ABR’s stub interfaces (e.g. S B, as |AB| > 1).

Stub interfaces (case II) cannot be in the SPT of any router(other than as leaves). Thus, not knowing the area of ABR’sstubs cannot cause errors in SPT. Regarding the interfaces incase I, note that we could determine the area they pertain tofrom the routes toward toward them reported by routers in caseABRs summarized prefixes: e.g. routers in area a2 would reportroutes toward the F − E link but those in areas a0 and a1 not.However, since such links must be connecting two ABRs theirnumber shall be small and Nemo requests the user to manuallyspecify the area they pertain to.

5. Scaling the RS

The shared trie may be able to store R&F state for manyrouters since trie nodes (prefixes) are reused by the RIB/FIBof all routers. Such a design, which allowed us to assess thecorrectness of the tool, can be improved for the RS to scale invery large networks. Next, we summarize the ideas behind asecond design that we implemented for the RS to better scalewith the number of routes (prefixes) and routers. We start byidentifying the inefficiencies of the original design.

While unibit tries are theoretically very efficient, their di-rect implementation in software can make a suboptimal use ofmemory. A trie node requires 3 pointers (one per child plusone toward a route vector). In a naive approach, in 64-bit plat-forms (where pointers are 64 bits or 8 bytes large), this spends24 bytes per node. Also, with N local routers, each route vectoroccupies 64 N bits, whether its cells –one per router– pointto routes or not. In addition, the routes reported by routersmay have many attributes in common, other than the destina-tion. Such a redundancy can be further exploited to reduce thememory footprint. On the other hand, long-lasting processesare prone to memory fragmentation which can cause the sys-tem to fail to provide the memory required (in spite of it beingavailable) the way it is dynamically allocated; virtually increas-ing the memory requirements. Our second solution is inspiredby previous work to store FIBs (tailored to suit our needs) and

relies on a pool-based memory allocation strategy (which alsoreduces pointer sizes), given that pooling is deemed beneficialtoward reducing memory fragmentation. We start by describingthe ideas behind the pooling methodology that we employed.

5.1. Pool-based memory allocation strategy

Fig. 4 sketches the organization of the the pool structure(POOL) that we use to allocate the items (of distinct types) thatmake up the new RS. Since items of a POOL are of the sametype and size, our RS uses several POOLs. By design, a POOLs istype-independent, though, and some of its features are optional.

A POOL points to a subpool vector that points to subpools.The first time an item is requested, a subpool is allocated. Eachsubpool can hold B items of a certain fixed size (both config-urable). When a subpool is full, a new one is allocated and, ifneeded, the subpool vector resized. Every item can be accessedgiven the (real) index in the subpool it resides, rindex, and the in-dex of the subpool itself, sindex. From both values, each positionis assigned a virtual index, vindex, used to reference it through-out. This creates a virtual, linear pool that permits avoidingnative 64-bit pointers: virtual indices are 32 bits wide, but theirstorage can be shrunk to fewer bits depending on the type ofitems they refer to. For instance, it is very unlikely that morethan 216 next-hop addresses exist. Thus, if kept in a POOL, theirvirtual indices can be stored with 16 bits. Variable next_free

contains the vindex of the next non-occupied position available.The vindex corresponding to a position or (sindex, rindex) pair

is Bsindex + rindex. E.g. item at rindex 6 in the third subpool(sindex = 2) has vindex = 2B + 6 (see Fig. 4). Translating somevindex V back to a (sindex, rindex) pair (e.g. to access an item inmemory given its vindex) can be easily done from the size ofeach subpool B. By construction, a vindex of V equals S B + Rfor some S and 0 ≤ R ≤ B − 1. Thus, V

B = S + RB with R

B <1. Therefore, S = bV

B c and R = V − S B. These operationscan be very efficiently implemented (e.g. if B is a power of2, the product and divisions can be done with bit-level shiftoperations). From S and R, a pointer to an item’s address istrivially derived using the size of items. Thus, we can think ofall items in a pool as being stored in a contiguous array indexedby virtual indices, that we can convert to pointers and vice-versavery efficiently.

Figure 4: POOL structure used to store (and reference) the items of the RS.

3

Page 4: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

Some items may need to be linked in lists. We solve this byletting items have a next field that keeps the vindex of the nextitem (instead of a native pointer) and forbid vindex 0 as it signalsa ”null“ in next fields. Similarly, items may need to be quicklyfound. For instance, routes may point to next-hop items whichneed to be sought to see if they exist or not (so as not to storethem per route). POOLs optionally have a hash index able tostore vindex values, according to a key. Collisions are resolvedby chaining using vindex values to link items (e.g. items d,c,a).

Some items may be freed and created frequently. To reusefreed positions (e.g. F in Fig. 4), we could recall the vindex ofthe last freed item and offer it on the next allocation. But, ifseveral items are freed between allocations, this requires keep-ing a list (or heap) of indices, or scanning subpools to find freedpositions (e.g. if items have a bit telling if they are free or oc-cupied). Since a POOL can have many subpools and these belarge, to avoid such searches, each subpool can have a bitmapof B bits, each bit corresponding to a position. When item withrindex = k in the jth subpool is freed, the kth bit in the bitmapof subpool j is set; so is the jth in the global bitmap (indicat-ing that subpool j has at least one freed position), and counternum_freed is incremented.

When an item is to be stored, if num_freed is 0, the positionwith index next_free is used. If num_freed > 0, the globalbitmap is examined to find a subpool with items freed (whosebitmap has bits set). The bits set correspond to the real in-dices of freed positions. We select the first subpool with freedpositions and the first freed position in it. When such a po-sition is reused, num_freed is decremented, the correspondingbit cleared and, if the subpool has no more freed positions (itsbitmap is all zeros), the bit in the global bitmap for the subpoolis cleared too. This is as if each item position had a bit indicat-ing if it is free or not. However, by keeping such bits together,we avoid scanning the pool and instead process a bitmap at thecost of 1 + 1

B additional bits per item. In this connection, wealways set B ≥ 64 and internally represent bitmaps as arraysof 64-bit words. To identify bits, each has a virtual bit position(vbit) derived from the index of the word it belongs to and the(real) offset therein (rbit), in a manner similar to vindex for items(words playing the role of subpools). E.g. vbit 130 is the 2nd

bit of the 3rd 64-bit word. As Fig. 4 sketches, translating a vir-tual bit position to a (word,rbit) pair and vice-versa can be veryefficiently done with shift operations; so can finding the first bitset in a 64-bit word (e.g. to find the first freed position). For in-stance, the GCC compiler has several highly optimized built-infunctions to count the number of leading zeros in a word.

5.2. Our second implementation of the RSOne way to optimize unibit tries in speed and space is with

multibit ones, where more than a prefix bit is processed (stored)at a time; specifically S (the stride length). But, not any multi-bit trie or FIB optimization technique may suit our purpose. Forinstance, for a certain value of S , a multibit node may have 2S

leaves and 2S − 1 inner nodes. By expanding prefixes to thestride length, inner nodes can be ignored and the node be im-plemented as an array of 2S cells, each containing two pointers(one for next-hop data and the other to point to a child node).

While this can be made so as not to affect forwarding, it is notsuitable for our purposes as it obscures prefix lengths. In oursecond implementation, the encoding of the trie borrows ideasfrom the Lulea approach [2] and Tree Bitmap [3] for FIBs.

Conceptually, we split the original unibit trie into chunksthat we call B-nodes, each containing an expanded portion of acertain depth S (the stride length). For some S , a B-node con-tains 2S − 1 internal nodes and 2S leaf or external nodes (for atotal of 2S +1 − 1). Internal nodes can point to routes, while ex-ternal ones (at depths multiple of S ) point to other B-nodes, inlower levels, to keep the structure of the original trie. The leftpart of Fig.5 depicts a tree of B-nodes starting at a root nodeand going down several levels. Note that the larger the value ofS , the fewer the number of levels (for a constant stride length)but the larger the B-nodes, and that the number of B-nodes thatcan exist at each level varies: there can only be 2S B-nodes atlevel L1; and up to 2iS at level i. Also notice that some unibitnodes are represented twice, appearing both as external nodesof a B-node and as the top node of the child B-node pointedto. This allows for the following optimization called leaf push-ing [3], to reduce the size of B-nodes. As external nodes mustpoint to other B-nodes, they cannot point to routes unless twopointers are used. To avoid using two pointers, if a route towarda prefix multiple of S exists (i.e. whose trie node is external),the route is pushed down to the topmost node of a child B-nodein the next level. Thus, each node in a B-node requires onepointer. E.g. in Fig. 5, route r4 toward prefix 111/3 would notbe pointed to by the external node of the root B-node but by thetopmost node of B-node C, in level L1.

With the above, a B-node could be implemented as an arrayof 2S +1 − 1 pointers. But this would be wasteful as many point-ers would be unused. Instead, a B-node consists of a bitmapof 2S +1 bits (one per node, plus an extra one whose position isthe first in the bitmap). A bit of 1 indicates that the correspond-ing node points to a route (if internal) or to a child B-node (ifexternal). Bitmaps are internally represented by one or more64-bit words. Each bit is identified by a virtual vbit positionor a (word, rbit) pair, as done for freed bitmaps. The corre-spondence of bits to unibit nodes can be done in multiple ways.B-node H in level L4 of Fig. 5 shows the one we employed,also used in [3]. However, they propose to use a table for themapping in software implementations. Instead, we let vbit startat 1, skipping the extra. With this, the vbit for a stride of sizes ≤ S is 2s + α, with α the decimal value of the stride. E.g. inB-node H, the vbit for node m (stride 101/3) is 23 + 5 = 13.

Bitmaps permit keeping less than 2S +1 pointers per B-node:only nodes whose bit is set require a pointer. Thus, shorterarrays can be used. But, then, we need to know which arraypointers correspond to each bitmap bit set (node and prefix).As suggested in [2][3] this can be derived from the bitmap: thearray index for a bit equals the number of bits set that precede it(as shown for the H B-node), which can be efficiently counted.

The right diagram in Fig. 5 shows how we code B-nodes(as a bitmap and a linkage pointer) and store them in POOLs, oneper level (as stride lengths and thus B-node sizes could vary perlevel). Pointer linkage points to an array of virtual indices of B-nodes or routes, the latter stored in routeset items. Which type

4

Page 5: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

Figure 5: Left: a tree of B-nodes representing a unibit trie. Right: Bitmap Tree implementation using POOLs and virtual indices. B-nodes are of fixed size (whicheases their allocation) and consist of a bitmap (one bit per node) and a linkage pointer. The specific bit assignment used is shown for B-node H. The linkage pointerpoints to an array with as many positions as bits are set in the bitmap. Which cell corresponds to a bit set can be known by counting the number of bits set before it:for instance, the cell for bit (node) “m” of B-node H is at index 3 since there are 3 bits set before “m” in the bitmap.

of item (and POOL) an index refers to is known from the bitmap:bits in the second half correspond to external nodes and pointto B-nodes, except in the last level where all point to routesets.

Routesets (described later) are linked in lists when distinctexist for a prefix (e.g. r2,r8 and r9 for prefix 1*). This allowsto use a single cell in the linkage array –no matter the numberof routes toward it– but requires traversing a list on lookupsand insertions. As it will be seen, such lists shall be short theway routesets are defined. Also note that linkage arrays need tobe resized and their elements be re-ordered (e.g. when a newprefix is inserted). In practice, this mostly occurs at the initialRIB transfers; subsequent routing changes not altering the orderof the elements in the arrays or their size2.

5.3. Storing R&F data in routesets

By design, we try to reuse routes’ common fields amongthe routes of a single router and also among those from distinctones, given that most routes shall be BGP and little differ fromrouter to router. Thus, a routeset keeps routing information formultiple routers N. Recall that we wish to store routes’ prop-erties (attributes, costs, type), how routers learnt them, how wediscovered them, whether in use or not, etc. We use two typesof routesets –one for BGP routes and the other for the rest (IGP,static etc.)– kept in distinct pools, as they differ in size. Which

2In [3], B-nodes that are children of a certain one are kept together, contigu-ously, so are routes (next-hop data). Thus, their B-nodes need only 2 pointersto the address of the first child/route. These optimizations are possible in FIBssince they are derived from RIBs and the structure of the trie is known before-hand. We need to build and expand the trie from the routes’ prefixes as routersannounce them (in no particular order). Hence, storing B-nodes contiguouslywould be hard and require extensive swapping operations.

pool a routeset belongs to is derived from the 32-bit virtual in-dices (kept in linkage arrays or in next fields) as follows. Ifthe first 16 are all set, the routeset is non-BGP and BGP other-wise. This allows to reference 216 ≈ 64K non-BGP routesetsand 232 − 64K ≈ 4.29 109 BGP ones; limits unlikely to be ex-ceeded in practice the way routesets are defined.

5.3.1. Non-BGP routesetsThe design of non-BGP routesets takes into account the fol-

lowing facts. Routers may have multiple IGP routes to the sameinternal destination. And IGP routes may have a subtype (e.g.inter-area), that may vary from router to router, even if towardthe same prefix Q. Generally, for some prefix Q, some routersshall report routes toward it (if in the same area), some mayor not (if in a distinct area; depending on if prefixes are aggre-gated or not), and at least one may report a connected route toit. Thus, we structure non-BGP routesets as Fig. 6 depicts.

Figure 6: Structure of a non-BGP routeset and of a shared nhop.

The type and flags field indicate the type (IGP or static)and protocol of a routeset (among others). An array of N cells(one per router) follows. Each cell stores the cost of the route

5

Page 6: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

(if IGP) and its subtype. One bit indicates if the route is con-nected (even if the flags indicate that the routeset is IGP). Thisway, we avoid using another routeset for the routers directlyconnected to a destination and a single one may be created perinternal prefix in most cases3. To store multiple ECMP routesper prefix, cells point to arrays whose positions hold 2 bits in-dicating if the corresponding routes have been discovered viaBGP(b) or SNMP(S) and the vindex of an nhop (shared or null).Shared nhops are kept in a hash-indexed POOL common to allrouters, while null nhops in one per router. Shared nhops pointto ifaddrs and contain an action field and resolution data foreach of the N routers, which maps next-hop addresses to one ormore adjacencies in each router’s adjacency table.

5.3.2. BGP routesetsBGP routesets strive to be as general as possible and, at

the same time, be customizable depending on: 1) the networksetup (e.g. whether RR are used or not) 2) the deployment setup(e.g. whether add_paths is enabled on the monitoring sessions)and 3) the amount of information desired per route. The gen-eral strategy is to reuse data common to many routes, fitting itinto fixed-sized units (to exploit POOLs), without too affectingimplementation complexity and performance regarding inser-tion, update, traversal (when dumping tables), lookup, and pathcomputation times.

To reuse data, BGP attributes are not kept in routesets them-selves but in attrset items (referenced by them), stored in an-other hash-indexed POOL. When a route is learnt, the AS_PATH,COMMUNITY and NEXT_HOP attributes are sought (the former twoin a conventional hash table; the latter, in the POOL of sharednhops) and get stored if not found. This process yields a pointerto an AS_PATH, one to a COMMUNITY set, and the vindex of an nhop.With these and the MED and LocPref attributes of the route (ifthere), an attrset is sought and, if none found, a new one stored.With this, routesets just include the vindex of an attrset (Va) andthat of an nhop (Vn) as depicted in Fig. 7.

Figure 7: Structure of a BGP routeset, referencing an attrset and an nhop. Bykeeping the next field at the beginning (as in non-bgp routesets), both types ofroutesets can coexist in lists.

Apart from a next field, routesets have flags and a Ride

field. The latter contains the ID of the router that learnt the route

3If prefix aggregation occurs, additional routesets may be created. However,their number shall typically be small. Also, as these refer to shorter prefixes, therouteset list for an internal prefix may typically still contain only one element.

in eBGP. If a route is first reported by a router that iBGP-learntit, the field is derived from attribute ORIGINATOR_ID and a flag isset. Thus, when the router that eBGP-learns it reports it, we cancheck the consistency of the field. Field V loc

n contains the vindex

of an alternative nhop, whose address should belong to routerwith ID Ride. This permits using the same routeset if routersuse next-hop self, where the next-hop reported in iBGP routeswould be that of an internal router. Otherwise, two routesetswould be instantiated for every BGP route.

In the simplest setup, routers report only their best route,which shall be in use. Hence, a single bit suffices to tell if arouter has a BGP route and uses it. Routesets include a bitmap(InUse) of N bits (one per router), a bit set indicating that arouter announced a route with attributes those in the referencedattrset and nhop). Note that the type of route (eBGP or iBGP)is implicitly stored: a routeset contains an eBGP route for therouter specified in Ride and iBGP routes otherwise, for thoserouters whose InUse bit is set. And, no extra field is needed toknow from whom an iBGP route was learnt (unless reflected).

As defined, a routeset uses a single bit per router, which suf-fices when nodes advertise only their best routes. However, itdoes not convey how a router learns each route if RR are used.On the other hand, if add_path is enabled, routers may reportmultiple routes per destination. In that case, we need to store,per router, whether it has a route or not, whether it uses it, itsselection result (e.g. best, 2nd best, etc.) and a pathID. Allthis information could be stored within a routeset in an array(using more than a bit per cell). However, this could signifi-cantly enlarge routesets for large N, and poorly adapt memoryrequirements to usage needs, unless distinct types of routesetswhere defined (which would complicate the implementation).Instead, we keep routesets small, but allow them to be virtuallyextended. This allows tailoring the memory spent depending onusage needs and may help preserve performance4.

5.4. Extending BGP routesetsBGP routesets can be extended to: 1) handle the case where

add_paths is enabled on the monitoring sessions (and severalroutes per prefix are reported) and 2) store route’s learning pathswhen RRs are used. These extensions are described next.

5.4.1. Add-paths-extended BGP routesetsWith add_paths, each routeset is extended, conceptually, as

if it included an additional array of N cells, AP, as Fig.8 shows.

Figure 8: Add-paths-extended BGP routesets.

The AP array, indexed by router IDs, uses 8 bits per router: aDP bit and a 7-bit pathID. While pathIDs are 32-bit large, most

4By keeping routesets small (with just the essential data to compute paths),many may fit in a memory page, which can reduce delay effects of page faults(thrashing) if memory is scarce, or allow them to be locked in physical memory.

6

Page 7: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

implementations (Cisco and Juniper) use integers starting from0 (and, as we could observe, reuse such numbers without gaps).We exploit this to store pathIDs using only 7 bits. This allowsfor up to 127 paths per prefix, which may suffice in almost everysetup. Recall that InUse bits tell if a router uses some route(set).Thus, if set, a bit implies that the router has a route. But, withadd_paths, routers may not use routes they report. Hence, froma zero InUse bit, we cannot distinguish the case where a routerdoes not have a route (in some routeset) to that where it has theroute but does not use it. To solve this, we use a pathID of 127to signal that a router has not a route in a routeset. The DP bit isinterpreted together with the InUse bit as shown in table 1.

InUse DPbit Interpretation1 1 Route chosen as best1 0 Non-best route being used (e.g. if BGP multipath is used)0 1 2nd best route, not in use (next best if the current fails)0 0 3rd or above (i.e. all)

Table 1: Meaning of the DPbit. Note that from the DP and InUse bits, andthe pathID, we can know if a router reported (has) a route or not, its pathID,whether the router uses it or not and whether it is its best, 2nd best or neither.

The AP arrays are stored separately from routesets, in a dis-tinct POOL. To link AP arrays with routesets without enlargingthe latter with extra pointers (or virtual indices), each AP ar-ray is stored at the same virtual position as the routeset it per-tains to, in its POOL. Storing the AP arrays separately does notreduce the overall memory requirements, but it incurs no mem-ory overhead (other than a POOL struct), and routesets are keptsmall. Moreover, no significant changes in the logic are needed.In this regard, note that, whether AP arrays are used or not, theencoding of the DP bit in table 1 is such that the semantics ofInUse bits are preserved; which are the only bits that need tobe inspected to compute paths (the content of AP arrays is onlyrelevant for updates or visualizing routes).

5.4.2. Learning-path-extended BGP routesetsBGP routesets can be extended (whether add_paths is used

or not) to recall how routes spread (their learning paths) whenRRs are used and the monitoring sessions are iBGP+RR.

In RR deployments, typically, clients peer with more than aRR for robustness. Also, large networks obey hierarchical ar-rangements where RR-clients act as RR of other routers. Thus,a router acting as RR can reflect the routes it learns (and chooses)from non-client peers to clients, but can also those it learns fromclients to non-clients (e.g. other RRs).

Recall that in the iBGP+RR setup, learning paths can be de-rived from the ORIGINATOR_ID and CLUSTER_LIST. The formeris set by the router first reflecting a route. The latter is updatedwhen a router reflects a route previously been reflected that al-ready has an ORIGINATOR_ID. Note that since, in the iBGP+RRsetup all routers act as RR (on monitoring sessions), they al-ways prepend their ID in the CLUSTER_LIST of reflected routes.

To see this, Fig. 9 depicts a setup where the routes for twoprefixes P1 and P2 are propagated. P1 is learnt by router A (notclient of any router) and P2 is learnt by c5 (client of R3). SinceA and c5 do not reflect the route (as they learn it in eBGP),they do not include any of such attributes when announcing the

routes toward the monitor. In case of P1, all of A’s peers (B,R1,R2 and R3) would report the route with an ORIGINATOR_ID of Aand CLUSTER_LIST equal to themselves (as all reflect routes toa monitor); and those being RR in other sessions would passclients the same route (e.g. R2 toward c1 and c2).

Figure 9: Example of route propagation in a hierarchical RR setup. R1, R2 andR3 act as RR and iBGP-peer among themselves and routers A and B, which arenot clients nor RR (except in the monitoring session). Router c1 is RR-client ofR1 and R2 and also acts as RR toward c11 and c12. The boxes represent (someof) the updates that routers would send toward a monitor (with the ORIGINA-TOR ID and CLUSTER LIST attributes) for the routes toward prefixes P1 andP2, eBGP learnt by routers A and c5 (a client) from external routers e1 and e2.

Thus, the sequence formed by the CLUSTER_LIST plus theORIGINATOR_ID equals the learning path (in reverse order).

For instance, the learning path of the route to P1 announcedby router c11 would be c11 ← c1 ← R2 ← A, and that for P2 asannounced by c12 be c12 ← c1 ← R1 ← R3 ← c5. In this regard,note that, as clients peer with more than a RR, several learningpaths may be possible. For instance, in Fig.9, c1 would receivethe route from both R1 and R2, of which it would pick one tofurther reflect to clients c11 and c12 and report to a monitor.

To store learning paths, we use an array, LP, of N cells (oneper local router). The jth cell of LP (corresponding to routerwith ID j) stores the ID of the rtr immediately before it in thelearning path (which propagated the route to j). Thus, the LP

array for prefix P2 in Fig. 9 could look like as shown in Fig. 10.

Figure 10: LP array for the routeset corresponding to prefix P2 in Fig.9. Notethat, in reality, array indices and cell contents would contain integer router IDs.

By traversing the LP array starting at some cell (e.g. that forrouter c12), the learning path toward the corresponding routercan be reconstructed, as shown by the right diagram in Fig. 10.Note that the LP array encodes a tree and that the external routerfrom which c5 learnt the route (e2) can be derived from theRide field of the routeset. As cells only contain the ID of lo-cal routers, dlog2(N)e bits suffice for each.

LP arrays are stored separately from routesets; but in a dis-tinct manner compared to AP arrays. We exploit the fact that

7

Page 8: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

the learning paths (and trees) for many routes shall coincide, asthey are limited by the number of routers having eBGP peersand the number of iBGP sessions, which, in RR setups shall bemuch smaller than with a full iBGP mesh.

Instead of keeping an LP array per routeset, these are sharedby many routesets. Each LP array has a reference count Rcntfield that counts the number of routesets that reference them.To quickly find LP arrays, these are stored in a hash-indexedPOOL and thus also include a next field containing the virtualindex of another array hashing to the same index. To referenceLP arrays from routesets an LP index vLP containing the virtualindex of an LP array is stored in a separate POOL at the sameindex position as the routeset, as depicted in Fig. 11.

Figure 11: BGP routesets extended with LP arrays, shared among many route-sets. The index of an LP array, vLP is itself stored in another POOL, at the samevirtual index as the routeset.

The above sharing of LP arrays is also the reason why ex-ternal routers are not kept in the array. Since a border routercould have several external peers, not including their IDs al-lows reusing the LP arrays by routes learnt from distinct exter-nal peers that get internally disseminated in the AS over thesame sessions. Finally, note that, the LP array correspondingto a routeset is not populated in one shot but rather gets up-dated as routers announce the routes matching a certain route-set. Thus, several instances of LPs are instantiated (or reused)until a routeset is fully updated. If Rcnt keeps track of the num-ber of routesets that point to each LP array and these get freedwhenever Rcnt reaches zero, it is easy to see that the number ofLP arrays stored cannot exceed the number of routesets. Thus,in the worst case (where each routeset points to a distinct LParray), the amount of memory spent is that in case LPs were notshared plus the sizes of Rcnt, next and VLP fields (that wouldnot be required in that case) and the two POOLs. However, thecase shall be that the set of sessions over which routes prop-agate (and hence distinct propagation trees) be much smallerthan the number of routesets.

5.5. Performing LPM lookups and computing paths

With the above scheme, we perform mLPM lookups, con-ceptually, in two phases, with a lazy traversal that does not in-spect all the routes hit until the end, as suggested in [3].

During the lookup, the destination address is split into stridess0, s1, s2.. of length S (except, maybe, the last) using shift op-erations so that the LSB of sk falls in the rightmost 32-bit wordposition, with weight 1. In phase 1, we traverse the B-node tree(starting root) using the S address bits at a time (i.e. a stride;s ≤ S if fewer remain) until failing. At each B-node (at somelevel i) we use the current stride si to check the correspondingbitmap bit. If of length S , we check the bit corresponding to an

external node (or internal if in the last level) and, if 1, we moveto a child B-node. We store the stride used si and a pointer tothe visited B-node Bi at each level in an array indexed by levels.As soon as a failure occurs (no child B-node exists), phase 2begins, which “climbs up” the trie by inspecting the bits corre-sponding to internal nodes in the current B-node (using a stridea bit shorter at a time) until hitting a 1 (which signals the exis-tence of a routeset). If no such bit exists, we keep on climbingup the tree processing the bitmap of the parent B-node with theprevious stride. This is why we store both in phase 1.

Determining which bitmap bits need to be tested when climb-ing up a B-node can be very efficiently done with shift opera-tions. Recall that the virtual position vbit of a bit correspondingto a stride sk of length s is 2s + sk. Suppose the starting stridelength is S . Its vbit can be computed as vbit = (1 << S ) + sk

and, that corresponding to a stride one bit shorter (for the parentunibit node), simply as vbit >> 1, since this yields 2S−1 plus thedecimal value of a stride one bit shorter. Moreover, if a B-nodepoints to no routeset, we set the extra bit, which allows us toskip the internal climb.

The way routesets are defined, the first one encountered (orthose it points to) may, in most cases, contain the LPM route forall routers, and the lookup shall end. Yet, this may not alwaysbe the case. The routesets found may not contain the routes forsome of the routers and phase 2 need to continue climbing up.When we hit a routeset, we scan its N-array to determine if itcontains routes for each router. To know when to stop climb-ing, we use a counter (initially at N) that we decrement for eachrouter for which we find its LPM route, and an array of sizeN to know for which routers we need to check routes in eachrouteset. The jth cell of the above array points to routesets con-taining the LPM route(s) for router j, similar to vector V[]. Byinspecting the j cell within the nhop pointed to by each routeset(containing the adjacencies a router uses to reach it), we derivea path as discussed.

6. Evaluation of the memory requirements of the RS

Our second implementation of the RS could not be testedin a production network. Next we briefly discuss its memoryrequirements when evaluating it offline with data from BGPdatasets publicly available. These contain BGP routes announcedby several routers in major IXPs and ISPs. Since the data wasobtained in setups that differ from those of a deployment ofNemo and some fields are missing, we made several assump-tions on its interpretation. To interpret the data and the results,let us first make a theoretical breakdown of the memory re-quired, at large.

The memory required by the new RS depends on that occu-pied by the tree of B-nodes , linkage arrays and the referencedroutesets. Recall that routesets do not store routing informationthemselves but virtual indices to nhops and attrsets for the sakeof data reuse. attrsets store MED and LocPref values and point toAS_PATHS and COMMUNITIES (kept in hash tables). Since a route-set is keyed on three values (nhop index Vn, attrset index Va andRide, storing the ID of the router learning a route in eBGP), the

8

Page 9: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

number of routesets depends on the number of distinct com-binations of the three fields. Similarly, the number of attrsetsdepends on the number of distinct ADD_PATH, COMMUNITIES, MEDand LocPref combinations in routes.

The total memory occupied by the RS is

Mem = B + Lkg + Aset + Path + Comm + Rsets + Lps

with B the memory occupied by B-nodes, Lkg by linkage arrays,Asets by attrsets, Path and Comm by AS_PATHS and COMMUNITIES,Rsets by routesets and Lps by LP (learning path) arrays. Eachof the Simmonds equals the number of entities Ni times theirsize S i (average if variably sized). Thus, omitting structs ofmarginal size (like pools) or router adjacency tables or evenhash indices (whose size may little depend on the number ofroutes or routers), the memory spent by the RS is

Mem = NBnodesS Bnode + NLkgS lkg + NasetsS aset +

NapathsS apath + NcommS comm + NhopsS nhop(N) +

NLpathsS Lpath(N) + NrsetsS rset(N)

where, except for linkage arrays and AS_PATHS and COMMUNITIES,all items are of constant size and none depends on the numberof routers N, except for routesets (e.g. due to Inuse bits if BGP)and nhops (due to resolution data per router) and LP arrays.

In the simplest setup (without add_paths), the size of eachBGP routeset is S rset(N) = F + N/8 bytes with F the size of itsfixed part (cf. Fig.7). If add_paths is used, routesets are virtu-ally enlarged with one byte per router and are of size S rset(N) =

F + N/8 + N (cf. Fig.8). Thus, the total memory required is

Mem = MF + Nrsets(F + N/8) + (NrsetsN) (1)

where MF = B+Lkg+Aset+Path+Comm+Lps. Note that since thenumber of distinct LP arrays shall be small and little depend onthe number of routers, MF + NrsetsF can be understood as theminimal memory required to keep eBGP routing information,NrsetsN/8 the memory needed to keep iBGP routes and NrsetsNthe extra memory needed in case routers report all their BGProutes aside from their best (when enabling add_paths).

6.1. Datasets, assumptions and results

Table 2 shows statistical data for the 9 datasets that we usedto evaluate the memory requirements of the RS. The first fivecorrespond to routes learnt by RIPE’s route collectors. The lastfour, to routes announced by routers in four major IXPs. Sincedatasets contain only BGP information, our evaluation does notinclude the memory required to store IGP routes. However, inpractice, such data may occupy a minuscule fraction of the total.

The first 10 columns show statistics of the data in eachdataset, independent of their storage, like the number of pre-fixes (Npre f ), number of distinct next-hops (Nnhop), AS_PATHS(NPAT H), COMMUNITIES (NCOM), or MED values (NMED) observed.Column f rom stores the number of distinct router IDs reportingroutes (and NAS the number of distinct ASes they pertain to). Inall datasets, the routes were reported in eBGP. Except in rrcc02,the number of prefixes was around 500K. Since the f rom routersannounced distinct routes to each prefix, the datasets contain

multiple routes to each prefix and the total number of routes(column Nroutes) amounted to several millions. In this regard,column routes

pre f ixshows the average number of routes per prefix

in each dataset, which ranged between 5.8 (rrcc02) and 35.1(OIX). It does not equal f rom because not all routers announcedone route to every BGP prefix.

We fed the RS with such datasets and measured the amountof memory spent by the RS as a function of the number of localrouters N. To give some intuition, on the results, we understoodthat each of the f rom routers corresponded to egress points in anetwork being monitored by Nemo and measured the result-ing memory when increasing N, which included the f rom egressrouters (eBGP speakers) and N− f rom internal routers that wouldonly report iBGP routes toward Nemo.

The middle columns in Table 2 show some figures on thenumber of items instantiated in the RS. Specifically, number ofB-nodes in the bitmap trie (NBnodes) and attrsets (NaS ets). Thenumber of AS_PATHS and COMMUNITIES stored was that indicatedin the leftmost columns already discussed. Similarly, the wayroutesets are keyed, their number Nrsets matched the total num-ber of routes Nroutes (several millions) and is not shown. Notehow hundreds of thousands B-nodes were stored and severalmillion attrsets. In this regard, routes

aS etsshows the number of route-

sets divided by the number of attrsets. Such a ratio was around5 in all datasets and is a measure of the extent to which data isreused: on average, each attrset was used by 5 routesets.

The rightmost columns show a breakdown of the memoryspent. The B-node trees required only 2MB (Btree) despite thenumber of B-nodes and circa 2MB of linkage data (Lkg), fora total of 5MB. The stride length was s = 5 bits. The restof columns show the amount of memory occupied by attrsets(AS et), AS_PATHS (Path) and COMMUNITIES (Com). Due to theirsmall number nhops occupied an insignificant amount of mem-ory and is not shown. Column MF shows the sum of the columnsand thus indicates the amount of memory required to store pre-fix and BGP routes attributes. Note how despite the huge num-ber of routes, the value of MF is reasonably small. Note thatMF does not include the amount of memory occupied by route-sets. As in eq. (1), the total amount of memory is MF plus thatoccupied by the Nrset routesets, which depends on the size ofeach routeset, in turn depending on N and whether a single bestroute is stored per router or multiple with add_paths.

To see the impact of routeset size and the effect of N, thetotal memory occupied is shown in Fig.12. The leftmost plotshows the memory expenditure with plain routesets, which in-clude an N-nit InUse bitmap. Note how, despite the large num-ber of routesets, the memory spent is quite small, requiring lessthan 1.6GB for N = 500 in dataset oix. When AP arrays are con-sidered, however, the memory spent is significantly higher. Thisis because our design targets the worst case where all routers re-port all routes. Thus, the AP arrays allocate one byte per routerfor each routeset. This is one byte per routeset for every routerin addition to the F/N + 1 bits. Thus, in oix (with 18Mi routes),this is roughly 18MB per router. But, despite the large numberof routes (e.g. 18 Million in oix, about 4GB of storage wouldsuffice to keep all the routes if reported by 200 routers.

9

Page 10: Building Nemo: Extended discussion - raspall.netraspall.net/nemoExt.pdfBuilding Nemo: Extended discussion Frederic Raspall Department of Network Engineering, Universitat Politecnica

Dataset Features (independent of storage) Number of items Memory Breakdown (MBytes)RIB Npre f f rom NAS Nnhop NPAT H NCOM NMED Nroutes

routespre f ix

NBnodes NaS etsroutesaS ets

Btree Lkg AS et Path Com MF

ripe1 488K 53 50 528 1.11 Mi 47.8K 143 8.13 Mi 16.6 135.2K 1.55 Mi 5.24 2.06 2.37 35.5 33.0 2.22 75.2rrc00 534K 15 15 16 1.10 Mi 36.5K 180 7.59 Mi 14.2 145.6K 1.42 Mi 5.33 2.22 2.59 32.6 34.4 1.84 73.7rrc01 510K 49 48 471 807 K 51.6K 160 5.63 Mi 11.0 136.6K 1.01 Mi 5.58 2.08 2.46 23.1 23.9 2.72 54.3rrc02 272K 36 36 69 251.6 K 5.3K 23 1.60 Mi 5.8 83.2K 343.53 K 4.66 1.26 1.35 7.8 7.6 0.31 18.4rrc03 511K 75 67 814 552.9 K 21.8K 107 3.86 Mi 7.5 137.5K 615.50 K 6.28 2.09 2.47 14.0 16.5 0.94 36.2OIX 527K 41 38 41 2.50 Mi N/A 498 18.58 Mi 35.1 142.9K 2.59 Mi 7.17 2.18 2.55 59.3 73.9 N/A 137.9LINX 517K 39 30 688 1.61 Mi 53.3K 300 13.63 Mi 26.3 139.9K 2.10 Mi 6.47 2.13 2.50 48.1 47.1 2.46 102.4EQIX 511K 19 18 174 933.3 K 29.1K 240 7.03 Mi 13.7 137.1K 1.07 Mi 6.58 2.09 2.47 24.4 27.5 1.34 57.9PAIX 520K 12 12 79 717.6 K 14.8K 179 5.04 Mi 9.6 139.5K 859.64 K 5.87 2.12 2.51 19.6 21.1 0.62 46.0

Table 2: Some features of the datasets (RIB dumps) used in our preliminary evaluation of the scalability of the RS, and amount of memory occupied by items whosesize does not depend on the number of local routers N. The number of routesets allocated matched the number of routes (Nroutes) in all cases.

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500

MB

yte

s

Number of routers N

Total Memory in MB (only best routes)

0

2000

4000

6000

8000

10000

12000

0 100 200 300 400 500

MB

yte

s

Number of routers N

Total Memory (in MB) when storing AP arrays (w/ add-paths)ripe1rrc00rrc01rrc02rrc03

oixlinx

eqixpaix

Figure 12: Total memory spent by the RS as a function of N when only storing 1 route per router (i.e. plain routesets, with N-bit InUse bitmaps) (left) and whenalso keeping AP arrays (needed to store pathIDs), on the right.

0.1

1

10

100

0 100 200 300 400 500

MB

yte

s / r

oute

r (logscale

)

Number of routers N

Total Memory per router (only best routes)

1

10

100

0 100 200 300 400 500

MB

yte

s / r

oute

r (logscale

)

Number of routers N

Total Memory per router when storing AP arrays (w/ add-paths)ripe1rrc00rrc01rrc02rrc03

oixlinx

eqixpaix

Figure 13: Memory per router with plain routesets (left) and with AP-extended routesets (right).

To see the impact in memory of storing AP arrays, Fig. 14shows how the memory spent relative to the case without them.This can be understood as the factor by which memory in-creases due to the need to store AP arrays, when routers reportall their BGP routes. Even if all routers reported the routes

pre f ixroutes

per prefix (35 in oix, cf. table 2), this factor is ≈ 6 for N = 200.Fig. 13 shows the total memory over the number of routers

N in both cases. Note that the memory per router decreases in N(because we are assuming that the number of routes constant,i.e. the number of egress nodes), which may not reflect thereality for very large N. The point is that, even in case thatadd_paths were enabled, the memory per router becomes muchsmaller than MF, which is, roughly the memory that would berequired to store the routes for every router separately in casethey reported all the routes.

From these results, we envisage that the RS may easily keepR&F data for several hundreds of nodes.

References

[1] Cisco Systems, BGP Route-Map Continue, URL http:

1

2

3

4

5

6

7

8

0 100 200 300 400 500

Number of routers N

Total Memory with add-paths relative to that without itripe1rrc00rrc01rrc02rrc03

oixlinx

eqixpaix

Figure 14: Memory with AP arrays relative to memory without. I.e. factor bywhich memory increases due to storing AP arrays.

//www.cisco.com/c/en/us/td/docs/ios-xml/ios/

iproute_bgp/configuration/15-mt/irg-15-mt-book/

irg-route-map-continue.html, 2014.[2] M. Degermark, A. Brodnik, S. Carlsson, S. Pink, Small Forwarding

Tables for Fast Routing Lookups, SIGCOMM Comput. Commun. Rev.27 (4) (1997) 3–14, ISSN 0146-4833, doi:\bibinfo{doi}{10.1145/263109.263133}, URL http://doi.acm.org/10.1145/263109.263133.

[3] W. Eatherton, G. Varghese, Z. Dittia, Tree Bitmap: Hardware/Software IPLookups with Incremental Updates, SIGCOMM Comput. Commun. Rev.34 (2) (2004) 97–122, ISSN 0146-4833.

10