View
11
Download
0
Category
Preview:
Citation preview
Router internals — 1
Router and switch architecture
Martin Heusse
Router internals — 2
Contenu
• Router architecture
• Routing table data structure
Router internals — 3
What is routing?
• Packet reception
✓ Interface FIFO (ring buffer?) holds groups of bits as they arrive✓ Packet queued until treated by central CPU or interface card
CPU (throw interrupt)✓ Check CRC, is there space in memory…✓ Packet classification (Dropped? Accepted? Switching method?)✓ Moved to input hold queue
Interface
Int. FIFORing buffer(sometimes)
Int. queue Packet routing
Classify
• Packet forwarding
✓ Look up routing table✓ Rewrite header (Ethernet, NAT?, TTL, checksum…)✓ Packet moved to output hold queue
Router internals — 4
Input and Output queues
• Input queues absorb transient forwarding subsystem saturationConfigurable
• Output queues holds burst of packets directed to one interface
Switching
• Generally, queues hold a given number of packets (not bytes)How would you implement a queue? Ring? Chained list? Whatis the storage unit (MTU size bin, packet, particle…)
✓ There can be several queues in parallel (various priorities)
Router internals — 5
Shared memory — first generation
• Ex.: conventional PC, Cisco 2800, HP ProCurve 7xxx
• Everything stored in same memory space
Bus Shared
memory
CPU
CPU memory (routing table)
• Limiting factor: memory access
Router internals — 6
Shared memory — first generation (cont.)
�� ��CIsco 25xx (1993)
Copyright © 1998, Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr 30
59
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
WICWIC
CPUCPU
Serial
CSU/DSUBRI S/T, U Ethernet
ConsoleI/O BusesI/O Buses
Boot ROM
PCMCIA
CP
U B
us
CP
U B
us
NVRAM
DRAMSIMM
On Board
DRAM
WIC Slot
Cisco 160x SeriesCisco 160x Series
M 68360
SCC
M 68360
SCC
60
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
WICWIC
On BoardDRAM
Hub Ports
2505, 25072516
DRAMSIMM
WIC Slots
2524, 2525
Syste
m B
us
WAN Intf
Async
2509-2512
Ether/TR
Mgmt Card2517-2519
Daughter and
Hub Cards
Boot ROM
PCMCIA
CP
U B
us
CP
U B
us
Flash
Dual
UART
NVRAM
M 68030M 68030
Sys Ctrl
ASIC
Sys Ctrl
ASIC
Cisco 25xx SeriesCisco 25xx Series
Router internals — 7
Shared memory — first generation (cont.)
�� ��CIsco 7200
Copyright © 1998, Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr 35
69
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
CPU
NPE
CPU
NPEMidplaneMidplane
I/O ControllerI/O Controller
I/O Bus
PCI Bus 0
EEPROM
PCIBridge
R 4700
R 5000
R 4700
R 5000
SRAM
! NPE-100
Layer 2Cache
NPE-200
CP
U B
us
CP
U B
us
DRAM
PCIBridge
PCI Bridge
PCI Bridge
PCI Bridge
PA 6PA 6
PA 4PA 4
PA 2PA 2
PCI Bridge
PCI Bridge
PCI Bridge
Cisco 720x SeriesCisco 720x Series
PA 5PA 5
PA 3PA 3
PA 1PA 1
Sys Ctrl
GT 64010
Sys Ctrl
GT 64010
PC
I B
us 2
PC
I B
us 1
PCMCIAFast Ether
Boot ROMDual
UART
NVRAM Boot Flash
70
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
Fan
Tray
CPU BusCPU Bus
M 68040M 68040Bit Slice
Proc.
Bit Slice
Proc.EEPROM
Multi BusIntf Logic
SP/SSPSP/SSP
Diag Bus
Intf LogicSRAMDRAM
RPRP
Mu
lti
Bu
s
CxBus Intf
DMA Logic
Multi Bus
Intf Logic
Cx Bus
Cisco 70x0—RP and SP/SSPCisco 70x0—RP and SP/SSP
Local BusLocal Bus
I/O CtrlI/O
Devices
Diag Bus
Intf Logic
Intf Proc.Intf Proc.Cx Bus
Arbiter
Diag BusDiag Bus
Intf Proc.Intf Proc.
Router internals — 8
Shared memory — first generation (cont.)
�� ��Juniper M40
• Decoupling of control plane and forwarding plane—forwarding by a dedicated ASIC
• 1998 — 40Gb/s
• JunOS based on FreeBSD
Router internals — 9
Shared memory — first generation (cont.)
PIC: Physical Interface Card
Router internals — 10
Intelligent line cards — 2d generation
• Ex.: Cisco 7500
• Line cards have some intelligence, write into each other’smemory
CPU
CPU
CPU
Bus
CPU
CPU memory
• Limiting factor: 1 shared bus (needs to be N times faster thaneach of N interfaces)
• Central processor dedicated only to control planeDistinct from Forwarding plane
Router internals — 11
Intelligent line cards — 2d generation(cont.)
�� ��CIsco 7500
Copyright © 1998, Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr 36
71
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
Cisco 70x0—RSP7000Cisco 70x0—RSP7000
Fan
Tray IP / VIPIP / VIP
DRAMRSP7KRSP7KCI BoardCI Board
QA ASIC
SRAM
RegisterFPGA
Diag BusIntf Logic
EnvmLogic
EEPROM
Cx BusArbiter
I/O
Bu
sBoot ROM
PCMCIA
Dual
UART
NVRAM
Boot Flash
CPU BusCPU Bus
R 4600R 4600
Sys Ctrl
ASICs
Sys Ctrl
ASICs
IP/VIPIP/VIP
Diag BusDiag Bus
Cx Bus
Diag BusFPGA
MemD Ctrl
ASICs
MemD Ctrl
ASICs
72
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
Boot ROM
PCMCIA
Dual
UART
NVRAM
Boot Flash
RSPRSP
QA ASIC
SRAM
RegisterFPGA
Cy Bus 0 Cy Bus 1
Cisco 75xx SeriesCisco 75xx Series
DRAMSys Ctrl
ASICs
Sys Ctrl
ASICs Layer 2Cache
CPU BusCPU Bus
R 4600R 4700
R 5000
R 4600R 4700
R 5000
I/O
Bu
s
Diag BusDiag Bus
IP/VIPIP/VIP IP/VIPIP/VIPCy BusArbiter
IP/VIPIP/VIP IP/VIPIP/VIP
Diag BusFPGA
MemD Ctrl
ASICs
MemD Ctrl
ASICs
Router internals — 12
Intelligent line cards — 2d generation(cont.)
�� ��Versatile Interface Processors (1/interface)
Copyright © 1998, Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr 37
73
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
R 4600R 4700
R 5000
R 4600R 4700
R 5000
Boot ROM
PCIBridge 2
DRAMVIPVIP
Cisco 75xx Series—VIPCisco 75xx Series—VIP
PCIBridge 1
PC
I B
us 1
PC
I B
us 2
PAPA
PAPAPMA
ASICs
PMAASICs
SRAM
Diag BusDiag BusCBus
Layer 2Cache
CPU BusCPU Bus
PC
I B
us 0
Packet
Bu
sCYA
ASICs
CYAASICs
I/O Ctrl
ASIC
I/O Ctrl
ASIC
EEPROM
DRAM Ctrl
ASICs
DRAM Ctrl
ASICs
74
601
1094_06F9_c4 © 1999, Cisco Systems, Inc.
RSP1RSP1
CPUCPU
R4600R4600 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 --100 MHz100 MHz
RSP2RSP2 R4600/R4700R4600/R4700 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 --100 MHz100 MHz
RSP4RSP4 R5000R5000 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 512 KB512 KB200 MHz200 MHz
PROCESSORPROCESSOR TYPETYPE CLOCKCLOCK CPU BusCPU Bus INTERFACESINTERFACESLayer 2Layer 2
CACHECACHE
VIP2-15VIP2-15 R4700R4700 RISCRISC 64 bit64 bit PAPA 512 KB512 KB100 MHz100 MHz
VIP2-40VIP2-40 R4700R4700 RISCRISC 64 bit64 bit PAPA 512 KB512 KB100 MHz100 MHz
VIP2-50VIP2-50 R4700R4700 RISCRISC 64 bit64 bit PAPA 512 KB512 KB200 MHz200 MHz
NPE100NPE100 R4700R4700 RISCRISC 64 bit64 bit PA,IO -FEPA,IO -FE 512 KB512 KB150 MHz150 MHz
NPE150NPE150 R4700R4700 RISCRISC 64 bit64 bit PA,IO -FEPA,IO -FE 512 KB512 KB150 MHz150 MHz
NPE200NPE200 R5000R5000 RISCRISC 64 bit64 bit PA,IO -FEPA,IO -FE 512 KB512 KB200 MHz200 MHz
RPRP M68040M68040 RISCRISC 32 bit32 bit IP, VIP1IP, VIP1 --40 MHz40 MHz
RPRP M68040M68040 RISCRISC 32 bit32 bit IP, VIP1IP, VIP1 --40 MHz40 MHz
RSP7KRSP7K R4600R4600 RISCRISC 64 bit64 bit IP, VIP1,VIP2IP, VIP1,VIP2 --100 MHz100 MHz
7500
7500
7200
7200
7000
7000
High End Router ComparisonHigh End Router Comparison
Router internals — 13
Intelligent line cards + crossbar switch3d generation
• Ex. Cisco 7600, juniper T-series, HP ProCurve Switch 4200vl…
• Crossbar switch:
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
• Routing of N simultaneous packet (or cell)
Router internals — 14
Head of line blocking
• Crossbar needs to be N times faster than each lineor need one buffer / output on each input (i.e. one buffer percrosspoint)
• What goes through the crossbar?
✓ ATM cells✓ particles? (→ packet reassembly)✓ packets
Router internals — 15
Cisco router performance (packets/s)
Router Process switching Fast switching2500 800 44002801 3000 90.0007200-NPE-G1 79.000 1.018.0007600-dCEF720 48.000.000 per slot
Router internals — 16
A step further
• Check Cisco CEF ( Cisco express forwarding)
• Banyan switch
• MPLS: packets carry an identifier of their processing
Router internals — 17
Routing table
• Static entries, routing protocols, ARP
• Can be large!
• Entries in use are cached (on interface cards, if applicable) → the cache holds a small subset of know destinations
Router internals — 18
Longest match lookup — Routing tablestorage
• Source: Ruiz-Sanchez, M.A.; Biersack, E.W.; Dabbous, W.,”Survey and taxonomy of IP address lookup algorithms,” Network,IEEE , vol.15, no.2, pp.8-23, Mar/Apr 2001
2 Classical Solution
2.1 Binary Trie
A natural way to represent prefixes is using a trie. A trie is a tree-based data structure allowing the organization
of prefixes on a digital basis by using the bits of prefixes to direct the branching. Figure 7 shows a binary trie
(each node has at most two children) representing a set of prefixes of a forwarding table.
a 0*
b 01000*
c 011*
d 1*
e 100*
g 1101*
h 1110*
i 1111*
f 1100*
Prefixes
0
0
0
0
0
0
0
0
0
1
1
1
11
1
1
hg if
ec
b
a d
Figure 7: Binary trie for a set of prefixes.
In a trie, a node on level l represents the set of all addresses that begin with the sequence of l bits consisting
of the string of bits labeling the path from the root to that node. For example, node c in figure 7 is at level
3 and represents all addresses beginning with the sequence 011. The nodes that correspond to prefixes are
shown in dark color and these nodes will contain the forwarding information or a pointer to the forwarding
information. Note also that prefixes are not only located at leaves but also at some internal nodes. This situation
arises because of exceptions in the aggregation process. For example, in figure 7 the prefixes b and c represent
exceptions to prefix a. Figure 8 illustrates this situation better. The trie shows the total address space, assuming
5-bit long addresses. Each leaf represents one possible address. We can see that address spaces covered by
prefixes b and c overlap with the address space covered by prefix a. Thus, prefixes b and c represent exceptions
to prefix a and refer to specific subintervals of the address interval covered by prefix a. In the trie in figure 7,
this is reflected by the fact that prefixes b and c are descendants of prefix a, or in other words, prefix a is itself
a prefix of b and c. As a result, some addresses will match several prefixes. For example, addresses beginning
with 011 will match both, prefix c and prefix a. Nevertheless, prefix c must be preferred because it is more
specific (longest match rule).
Tries allow in a straightforward way to find the longest prefix that matches a given destination address. The
search in a trie is guided by the bits of the destination address. At each node, the search proceeds to the left or
to the right according to the sequential inspection of the address bits. While traversing the trie, every time we
visit a node marked as prefix (i.e., a dark node) we remember this prefix as the longest match found so far. The
search ends, when there is no more branch to take and the longest or best matching prefix will be the last prefix
remembered. For instance, if we search the best matching prefix (BMP) for an address beginning with the bit
pattern 10110 we start at the root in figure 7. Since the first bit of the address is 1 we move to the right, to the
node marked with prefix d and we remember d as the BMP found so far. Then we move to the left since the
second address bit is 0, this time the node is not marked as prefix, so d is still the BMP found so far. Next the
third address bit is 1 but at this point there is no branch labeled 1, so search ends and the last remembered BMP
7
Router internals — 19
Path-compressed trie
by the bit-number field in the nodes traversed. When a node marked as prefix is encountered, a comparison with
the actual prefix value is performed. This is necessary since during the descent in the trie we may skip some
bits. If a match is found, we proceed traversing the trie and keep the prefix as the BMP so far. Search ends when
a leaf is encountered or a mismatch is found. As usual the BMP will be the last matching prefix encountered.
For instance, if we look for the BMP of an address beginning with the bit pattern 010110 in the path compressed
trie shown in figure 9, we proceed as follows: We start at the root node and since its bit number is 1 we inspect
the first bit of the address. The first bit is 0 so we go to the left. Since the node is marked as prefix we compare
the prefix a with the corresponding part of the address (0). Since they match we proceed and keep a as the BMP
so far. Since the node’s bit number is 3 we skip the second bit of the address and inspect the third one. This bit
is 0 so we go to the left. Again we check whether the prefix b matches the corresponding part of the address
(01011). Since they do not match, search stops and the last remembered BMP (prefix a) is the correct BMP.
Path-compression was first proposed in a scheme called PATRICIA [10], but this scheme does not support
longest prefix matching. Sklower proposed a scheme with modifications for longest prefix matching in [13]. In
fact, this variant was originally designed not only to support prefixes but more general non-contiguous masks.
Since this feature was really never used, current implementations differ somehow from the Sklower’s original
scheme. For example, the BSD version of the path-compressed trie (referred to as BSD trie) is essentially the
same as we have just described. The basic difference is that in the BSD scheme, the trie is first traversed without
checking the prefixes at internal nodes. Once at a leaf, the traversed path is backtracked in search of the longest
matching prefix. At each node with a prefix, or a list of prefixes, a comparison is performed to check for a
match. Search ends when a match is found. Comparison operations are not made on the downward path in the
hope that not many exception prefixes exist. Note that with this scheme, in the worst case, the path is completely
traversed two times. In the case of the original Sklower’s scheme the backtrack phase also needs to do recursive
descents of the trie because non-contiguous masks are allowed.
a 0*
b 01000*
c 011*
d 1*
e 100*
g 1101*
h 1110*
i 1111*
f 1100*
Prefixes1
0
10
44
3
3 2
1
0 1
1010
10
eb c
da
hgf i
Figure 9: A path-compressed trie
Until recently, the longest matching prefix problem has been addressed by using data structures based on
path-compressed tries, like the BSD trie. Path-compression makes much sense when the binary trie is sparsely
populated. But when the number of prefixes increases and the trie gets denser, using path compression has little
benefit. Moreover, the principal disadvantage of path-compressed tries, as well as binary tries in general, is that
a search needs to do many memory accesses, in the worst case 32 for IPv4 addresses. For example, for a typical
backbone router [18] with 47113 prefixes, the BSD version for a path-compressed trie creates 93304 nodes. The
maximal height is 26, while the average height is almost 20. For the same prefixes, a simple binary trie (with
one-child nodes) has a maximal height of 30 and an average height of almost 22. As we can see, the heights of
both tries are very similar and the BSD trie may perform additional comparison operations when backtracking
is needed.
9
• Useful for sparsely populated space. But many prefixes used inIPv4
• Backtracking necessary : after reaching e and finding out that itdoes not match, need to go back to d (for 101… for example)
Router internals — 20
Disjoint prefix trie
We have seen that prefixes can overlap (see figure 4). In a trie, when two prefixes overlap, one of them is
itself a prefix of the other, see figures 7 and 8. Since prefixes represent intervals of contiguous addresses, when
two prefixes overlap this means that one interval of addresses contains another interval of addresses, see figure
4 and 8. In fact, that is why an address can be matched to several prefixes. If several prefixes match, the longest
prefix match rule is used in order to find the most specific forwarding information. One way to avoid the use of
the longest prefix match rule and to still find the most specific forwarding information is to transform a given
set of prefixes into a set of disjoint prefixes. Disjoint prefixes do not overlap and thus no address prefix is itself
prefix of another one. A trie representing a set of disjoint prefixes will have prefixes at the leaves but not at
internal nodes. To obtain a disjoint-prefix binary trie, we simply add leaves to nodes that have only one child.
These new leaves are new prefixes that inherit the forwarding information of the closest ancestor marked as a
prefix. Finally, internal nodes marked as prefixes are unmarked. For example, figure 10 shows the disjoint-prefix
binary trie that corresponds to the trie in figure 7. Prefixes a , a , a have inherited the forwarding information
of the original prefix a, which now has been suppressed. Prefix d has been obtained in a similar way. Since
prefixes at internal nodes are expanded or pushed down to the leaves of the trie, this technique has been called
leaf pushing by Srinivasan et al. [14]. Figure 11 shows the disjoint intervals of addresses that correspond to the
disjoint-prefix binary trie of figure 10.
a 0*
b 01000*
c 011*
d 1*
e 100*
g 1101*
h 1110*
i 1111*
f 1100*
Prefixes
0
0
0
0
0
0
0
0
0 0
1
1
1
11
1
1
1
1
1
hg if
!dec
"a
#ab
!a
Figure 10: Disjoint-prefix binary trie
a 0*
b 01000*
c 011*
d 1*
e 100*
g 1101*
h 1110*
i 1111*
f 1100*
Prefixes
Disjoint intervals of addresses
!a a
!d!
d!
d!
d!
d!
!a
!a
!a
!a
!aa b c c c c e e e e f f g g h h i i
c e
f g h i
a!
a#
a"
a"
a"!
Figure 11: Expanded disjoint-prefix binary trie
Compression techniques: Data compression tries to remove redundancy from the encoding. The idea to use
compression comes from the fact that expanding the prefixes increases information redundancy. Compression
11
• Disjoint prefixes do not overlap
Router internals — 21
There are other techniques!
Router internals — 22
Sources
• S. Keshav; “An engineering approach to computer networking”
• Cisco Router Architecturewww.cisco.com/networkers/nw99_pres/601.pdf
• Ross & Kurose “Computer Networking”
• …
Recommended