Router and switch architecturePCI Bus 0 EEPROM PCI Bridge R 4700 R 5000 SRAM! NPE-100 Layer 2 Cache...

Router internals — 1

Router and switch architecture

Martin Heusse

Contenu

• Router architecture

• Routing table data structure

What is routing?

• Packet reception

✓ Interface FIFO (ring buffer?) holds groups of bits as they arrive✓ Packet queued until treated by central CPU or interface card

CPU (throw interrupt)✓ Check CRC, is there space in memory…✓ Packet classification (Dropped? Accepted? Switching method?)✓ Moved to input hold queue

Interface

Int. FIFORing buffer(sometimes)

Int. queue Packet routing

Classify

• Packet forwarding

✓ Look up routing table✓ Rewrite header (Ethernet, NAT?, TTL, checksum…)✓ Packet moved to output hold queue

Input and Output queues

• Input queues absorb transient forwarding subsystem saturationConfigurable

• Output queues holds burst of packets directed to one interface

Switching

• Generally, queues hold a given number of packets (not bytes)How would you implement a queue? Ring? Chained list? Whatis the storage unit (MTU size bin, packet, particle…)

✓ There can be several queues in parallel (various priorities)

Shared memory — first generation

• Ex.: conventional PC, Cisco 2800, HP ProCurve 7xxx

• Everything stored in same memory space

Bus Shared

memory

CPU memory (routing table)

• Limiting factor: memory access

Shared memory — first generation (cont.)

�� CIsco 25xx (1993)

WICWIC

CPUCPU

Serial

CSU/DSUBRI S/T, U Ethernet

ConsoleI/O BusesI/O Buses

Boot ROM

PCMCIA

DRAMSIMM

On Board

WIC Slot

Cisco 160x SeriesCisco 160x Series

M 68360

WICWIC

On BoardDRAM

Hub Ports

2505, 25072516

DRAMSIMM

WIC Slots

2524, 2525

WAN Intf

2509-2512

Ether/TR

Mgmt Card2517-2519

Daughter and

Hub Cards

Boot ROM

PCMCIA

M 68030M 68030

Sys Ctrl

Cisco 25xx SeriesCisco 25xx Series

�� CIsco 7200

NPEMidplaneMidplane

I/O ControllerI/O Controller

I/O Bus

PCI Bus 0

EEPROM

PCIBridge

R 4700

R 5000

R 4700

R 5000

! NPE-100

Layer 2Cache

NPE-200

PCIBridge

PCI Bridge

PA 6PA 6

PA 4PA 4

PA 2PA 2

PCI Bridge

Cisco 720x SeriesCisco 720x Series

PA 5PA 5

PA 3PA 3

PA 1PA 1

Sys Ctrl

GT 64010

Sys Ctrl

GT 64010

PCMCIAFast Ether

Boot ROMDual

NVRAM Boot Flash

CPU BusCPU Bus

M 68040M 68040Bit Slice

Bit Slice

Proc.EEPROM

Multi BusIntf Logic

SP/SSPSP/SSP

Diag Bus

Intf LogicSRAMDRAM

CxBus Intf

DMA Logic

Multi Bus

Intf Logic

Cx Bus

Cisco 70x0—RP and SP/SSPCisco 70x0—RP and SP/SSP

Local BusLocal Bus

I/O CtrlI/O

Devices

Diag Bus

Intf Logic

Intf Proc.Intf Proc.Cx Bus

Arbiter

Diag BusDiag Bus

Intf Proc.Intf Proc.

�� Juniper M40

• Decoupling of control plane and forwarding plane—forwarding by a dedicated ASIC

• 1998 — 40Gb/s

• JunOS based on FreeBSD

PIC: Physical Interface Card

Intelligent line cards — 2d generation

• Ex.: Cisco 7500

• Line cards have some intelligence, write into each other’smemory

CPU memory

• Limiting factor: 1 shared bus (needs to be N times faster thaneach of N interfaces)

• Central processor dedicated only to control planeDistinct from Forwarding plane

Intelligent line cards — 2d generation(cont.)

�� CIsco 7500

Cisco 70x0—RSP7000Cisco 70x0—RSP7000

Tray IP / VIPIP / VIP

DRAMRSP7KRSP7KCI BoardCI Board

QA ASIC

RegisterFPGA

Diag BusIntf Logic

EnvmLogic

EEPROM

Cx BusArbiter

sBoot ROM

PCMCIA

Boot Flash

CPU BusCPU Bus

R 4600R 4600

Sys Ctrl

IP/VIPIP/VIP

Diag BusDiag Bus

Cx Bus

Diag BusFPGA

MemD Ctrl

Boot ROM

PCMCIA

Boot Flash

RSPRSP

QA ASIC

RegisterFPGA

Cy Bus 0 Cy Bus 1

Cisco 75xx SeriesCisco 75xx Series

DRAMSys Ctrl

Sys Ctrl

ASICs Layer 2Cache

CPU BusCPU Bus

R 4600R 4700

R 5000

R 4600R 4700

R 5000

Diag BusDiag Bus

IP/VIPIP/VIP IP/VIPIP/VIPCy BusArbiter

IP/VIPIP/VIP IP/VIPIP/VIP

Diag BusFPGA

MemD Ctrl

Intelligent line cards — 2d generation(cont.)

�� Versatile Interface Processors (1/interface)

R 4600R 4700

R 5000

R 4600R 4700

R 5000

Boot ROM

PCIBridge 2

DRAMVIPVIP

Cisco 75xx Series—VIPCisco 75xx Series—VIP

PCIBridge 1

PAPAPMA

PMAASICs

Diag BusDiag BusCBus

Layer 2Cache

CPU BusCPU Bus

Packet

CYAASICs

I/O Ctrl

EEPROM

DRAM Ctrl

RSP1RSP1

CPUCPU

R4600R4600 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 --100 MHz100 MHz

RSP2RSP2 R4600/R4700R4600/R4700 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 --100 MHz100 MHz

RSP4RSP4 R5000R5000 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 512 KB512 KB200 MHz200 MHz

PROCESSORPROCESSOR TYPETYPE CLOCKCLOCK CPU BusCPU Bus INTERFACESINTERFACESLayer 2Layer 2

CACHECACHE

VIP2-15VIP2-15 R4700R4700 RISCRISC 64 bit64 bit PAPA 512 KB512 KB100 MHz100 MHz

NPE100NPE100 R4700R4700 RISCRISC 64 bit64 bit PA,IO -FEPA,IO -FE 512 KB512 KB150 MHz150 MHz

RPRP M68040M68040 RISCRISC 32 bit32 bit IP, VIP1IP, VIP1 --40 MHz40 MHz

RSP7KRSP7K R4600R4600 RISCRISC 64 bit64 bit IP, VIP1,VIP2IP, VIP1,VIP2 --100 MHz100 MHz

High End Router ComparisonHigh End Router Comparison

Intelligent line cards + crossbar switch3d generation

• Ex. Cisco 7600, juniper T-series, HP ProCurve Switch 4200vl…

• Crossbar switch:

• Routing of N simultaneous packet (or cell)

Head of line blocking

• Crossbar needs to be N times faster than each lineor need one buffer / output on each input (i.e. one buffer percrosspoint)

• What goes through the crossbar?

✓ ATM cells✓ particles? (→ packet reassembly)✓ packets

Cisco router performance (packets/s)

Router Process switching Fast switching2500 800 44002801 3000 90.0007200-NPE-G1 79.000 1.018.0007600-dCEF720 48.000.000 per slot

A step further

• Check Cisco CEF ( Cisco express forwarding)

• Banyan switch

• MPLS: packets carry an identifier of their processing

Routing table

• Static entries, routing protocols, ARP

• Can be large!

• Entries in use are cached (on interface cards, if applicable) → the cache holds a small subset of know destinations

Longest match lookup — Routing tablestorage

• Source: Ruiz-Sanchez, M.A.; Biersack, E.W.; Dabbous, W.,”Survey and taxonomy of IP address lookup algorithms,” Network,IEEE , vol.15, no.2, pp.8-23, Mar/Apr 2001

2 Classical Solution

2.1 Binary Trie

A natural way to represent prefixes is using a trie. A trie is a tree-based data structure allowing the organization

of prefixes on a digital basis by using the bits of prefixes to direct the branching. Figure 7 shows a binary trie

(each node has at most two children) representing a set of prefixes of a forwarding table.

b 01000*

c 011*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes

Figure 7: Binary trie for a set of prefixes.

In a trie, a node on level l represents the set of all addresses that begin with the sequence of l bits consisting

of the string of bits labeling the path from the root to that node. For example, node c in figure 7 is at level

3 and represents all addresses beginning with the sequence 011. The nodes that correspond to prefixes are

shown in dark color and these nodes will contain the forwarding information or a pointer to the forwarding

information. Note also that prefixes are not only located at leaves but also at some internal nodes. This situation

arises because of exceptions in the aggregation process. For example, in figure 7 the prefixes b and c represent

exceptions to prefix a. Figure 8 illustrates this situation better. The trie shows the total address space, assuming

5-bit long addresses. Each leaf represents one possible address. We can see that address spaces covered by

prefixes b and c overlap with the address space covered by prefix a. Thus, prefixes b and c represent exceptions

to prefix a and refer to specific subintervals of the address interval covered by prefix a. In the trie in figure 7,

this is reflected by the fact that prefixes b and c are descendants of prefix a, or in other words, prefix a is itself

a prefix of b and c. As a result, some addresses will match several prefixes. For example, addresses beginning

with 011 will match both, prefix c and prefix a. Nevertheless, prefix c must be preferred because it is more

specific (longest match rule).

Tries allow in a straightforward way to find the longest prefix that matches a given destination address. The

search in a trie is guided by the bits of the destination address. At each node, the search proceeds to the left or

to the right according to the sequential inspection of the address bits. While traversing the trie, every time we

visit a node marked as prefix (i.e., a dark node) we remember this prefix as the longest match found so far. The

search ends, when there is no more branch to take and the longest or best matching prefix will be the last prefix

remembered. For instance, if we search the best matching prefix (BMP) for an address beginning with the bit

pattern 10110 we start at the root in figure 7. Since the first bit of the address is 1 we move to the right, to the

node marked with prefix d and we remember d as the BMP found so far. Then we move to the left since the

second address bit is 0, this time the node is not marked as prefix, so d is still the BMP found so far. Next the

third address bit is 1 but at this point there is no branch labeled 1, so search ends and the last remembered BMP

Path-compressed trie

by the bit-number field in the nodes traversed. When a node marked as prefix is encountered, a comparison with

the actual prefix value is performed. This is necessary since during the descent in the trie we may skip some

bits. If a match is found, we proceed traversing the trie and keep the prefix as the BMP so far. Search ends when

a leaf is encountered or a mismatch is found. As usual the BMP will be the last matching prefix encountered.

For instance, if we look for the BMP of an address beginning with the bit pattern 010110 in the path compressed

trie shown in figure 9, we proceed as follows: We start at the root node and since its bit number is 1 we inspect

the first bit of the address. The first bit is 0 so we go to the left. Since the node is marked as prefix we compare

the prefix a with the corresponding part of the address (0). Since they match we proceed and keep a as the BMP

so far. Since the node’s bit number is 3 we skip the second bit of the address and inspect the third one. This bit

is 0 so we go to the left. Again we check whether the prefix b matches the corresponding part of the address

(01011). Since they do not match, search stops and the last remembered BMP (prefix a) is the correct BMP.

Path-compression was first proposed in a scheme called PATRICIA [10], but this scheme does not support

longest prefix matching. Sklower proposed a scheme with modifications for longest prefix matching in [13]. In

fact, this variant was originally designed not only to support prefixes but more general non-contiguous masks.

Since this feature was really never used, current implementations differ somehow from the Sklower’s original

scheme. For example, the BSD version of the path-compressed trie (referred to as BSD trie) is essentially the

same as we have just described. The basic difference is that in the BSD scheme, the trie is first traversed without

checking the prefixes at internal nodes. Once at a leaf, the traversed path is backtracked in search of the longest

matching prefix. At each node with a prefix, or a list of prefixes, a comparison is performed to check for a

match. Search ends when a match is found. Comparison operations are not made on the downward path in the

hope that not many exception prefixes exist. Note that with this scheme, in the worst case, the path is completely

traversed two times. In the case of the original Sklower’s scheme the backtrack phase also needs to do recursive

descents of the trie because non-contiguous masks are allowed.

b 01000*

c 011*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes1

Figure 9: A path-compressed trie

Until recently, the longest matching prefix problem has been addressed by using data structures based on

path-compressed tries, like the BSD trie. Path-compression makes much sense when the binary trie is sparsely

populated. But when the number of prefixes increases and the trie gets denser, using path compression has little

benefit. Moreover, the principal disadvantage of path-compressed tries, as well as binary tries in general, is that

a search needs to do many memory accesses, in the worst case 32 for IPv4 addresses. For example, for a typical

backbone router [18] with 47113 prefixes, the BSD version for a path-compressed trie creates 93304 nodes. The

maximal height is 26, while the average height is almost 20. For the same prefixes, a simple binary trie (with

one-child nodes) has a maximal height of 30 and an average height of almost 22. As we can see, the heights of

both tries are very similar and the BSD trie may perform additional comparison operations when backtracking

is needed.

• Useful for sparsely populated space. But many prefixes used inIPv4

• Backtracking necessary : after reaching e and finding out that itdoes not match, need to go back to d (for 101… for example)

Disjoint prefix trie

We have seen that prefixes can overlap (see figure 4). In a trie, when two prefixes overlap, one of them is

itself a prefix of the other, see figures 7 and 8. Since prefixes represent intervals of contiguous addresses, when

two prefixes overlap this means that one interval of addresses contains another interval of addresses, see figure

4 and 8. In fact, that is why an address can be matched to several prefixes. If several prefixes match, the longest

prefix match rule is used in order to find the most specific forwarding information. One way to avoid the use of

the longest prefix match rule and to still find the most specific forwarding information is to transform a given

set of prefixes into a set of disjoint prefixes. Disjoint prefixes do not overlap and thus no address prefix is itself

prefix of another one. A trie representing a set of disjoint prefixes will have prefixes at the leaves but not at

internal nodes. To obtain a disjoint-prefix binary trie, we simply add leaves to nodes that have only one child.

These new leaves are new prefixes that inherit the forwarding information of the closest ancestor marked as a

prefix. Finally, internal nodes marked as prefixes are unmarked. For example, figure 10 shows the disjoint-prefix

binary trie that corresponds to the trie in figure 7. Prefixes a , a , a have inherited the forwarding information

of the original prefix a, which now has been suppressed. Prefix d has been obtained in a similar way. Since

prefixes at internal nodes are expanded or pushed down to the leaves of the trie, this technique has been called

leaf pushing by Srinivasan et al. [14]. Figure 11 shows the disjoint intervals of addresses that correspond to the

disjoint-prefix binary trie of figure 10.

b 01000*

c 011*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes

Figure 10: Disjoint-prefix binary trie

b 01000*

c 011*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes

Disjoint intervals of addresses

!aa b c c c c e e e e f f g g h h i i

f g h i

Figure 11: Expanded disjoint-prefix binary trie

Compression techniques: Data compression tries to remove redundancy from the encoding. The idea to use

compression comes from the fact that expanding the prefixes increases information redundancy. Compression

• Disjoint prefixes do not overlap

There are other techniques!

Sources

• S. Keshav; “An engineering approach to computer networking”

• Cisco Router Architecturewww.cisco.com/networkers/nw99_pres/601.pdf

• Ross & Kurose “Computer Networking”

• …

Router and switch architecturePCI Bus 0 EEPROM PCI Bridge R 4700 R 5000 SRAM! NPE-100 Layer 2 Cache...

Documents

W83628F & W83629D PCI TO ISA BRIDGE SET

LogiCORE IP AXI EP Bridge for PCI Express (v1.00a) · 2021. 8. 5. · LogiCORE IP AXI EP Bridge for PCI Express (v1.00a) Functional Description The AXI Bridge for PCI Express provides

LogiCORE IP PLBV46 PCI Full Bridge (v1.04.a)32-bit Revision 2.2 compliant Peripheral Component Interconnect (PCI™) bus. The bridge is referred to as the PLBV46 PCI Bridge in this

Cisco 7200 !#$ NPE-G1...!"#$%&'Cisco 7200!"#$%&'(NPE-G1 1 Cisco 7200!"#$ NPE-G1 NPE-G1 LAN!"#$ %&'$(!"#$% NPE-G1!" 1 Cisco CEF ! !"#$%& 100 PPS Cisco 7200! NPE-400 140%

Xilinx XAPP1001 Reference System : PLBv46 PCI Using the …€¦ · Figure 3: PCI Bus Devices on the ML410 PCI-to-PCI Bridge (U32) 5.0V PCI Slot 6 5.0V PCI Slot 4 ALi South Bridge

AXI Bridge for PCI Express Gen3 Subsystem v2 · AXI Bridge for PCI Express Gen3 v2.1 5 PG194 June 8, 2016 Chapter 1 Overview The AXI Bridge for PCI Express Gen3 core is designed for

PC87200 PCI to ISA Bridge

PCI Bridge Manual Chapter 7

Model NPE/NPE-F

PCI-to-PCI Bridge Architecture Specification - CERN · PDF fileThis PCI-to-PCI Bridge Architecture Specification is provided "as is" with no warranties whatsoever, ... 135 12.3.13

Asynchronous PCI-to-PCI Bridge (Rev. A - TI.com · PCI2060 Asynchronous PCI-to-PCI Bridge Data Manual Literature Number ... TI assumes no liability for applications assistance or

0 OPB PCI Full Bridge (v1.02a)japan.xilinx.com/.../ip_documentation/opb_pci.pdf · 2019-10-10 · The OPB PCI Full Bridge design provides full bridge functionality between the Xilinx

CPCI PCI Bridge Design Manual Presentation

AXI Bridge for PCI Express Gen3 Subsystem v2 · 2019-10-10 · AXI Bridge for PCI Express Gen3 v2.1 5 PG194 June 8, 2016 Chapter 1 Overview The AXI Bridge for PCI Express Gen3 core

XIO2001 PCIe to PCI Bus Translation Bridge (Rev. I) XIO2001 is a single-function PCI Express to PCI translation bridge that is fully compliant to the PCI Express to PCI/PCI-X Bridge

PCI Express to PCI Bus Translation Bridge (Rev. B · 2011-04-23 · XIO2000 PCI Express to PCI Bus Translation Bridge Data Manual Literature Number: SCPS097B November 2005 ... TI

PCI Express-to-Generic Local Bus Bridge Data Book · – PCI to PCI Bridge Architecture Specification, Revision 1.1 – PCI Bus Power Management Interface Specification, Revision

PCI Express to PCI/PCI-X Bridge Specification Revision 1djm202/pdf/...pci express to pci/pci-x bridge specification, rev. 1.0 r (r r r r) ) equirements of of pci-

PCI Express-to-PCI Bridge Sheets/Pericom PDFs... · 2014-02-12 · PI7C9X113SL PCIe-to-PCI Bridge Page 3 of 79 Pericom Semiconductor July 2010, Revision 0.3 REVISION HISTORY DATE

of PCI Bridge Check 1. Condition Assessment