Upload
lowri
View
19
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Introduction. Linus Svensson D5, [email protected] Åke Östmark D5, [email protected]. Why We Are Here. The architecture of a Network Processor Unit (NPU) Master’s thesis - a joint operation between Luleå University of Technology and SwitchCore AB. Today's Topics. Background - PowerPoint PPT Presentation
Citation preview
1
Introduction
Linus SvenssonLinus Svensson D5, [email protected], [email protected]
Åke ÖstmarkÅke Östmark D5, [email protected], [email protected]
2
Why We Are Here
The architecture of a Network Processor The architecture of a Network Processor Unit (NPU)Unit (NPU)
Master’s thesis - a joint operation between Master’s thesis - a joint operation between Luleå University of Technology and Luleå University of Technology and SwitchCore ABSwitchCore AB
3
Today's Topics
BackgroundBackground Ethernet and internetworksEthernet and internetworks Switches and routersSwitches and routers
NPU (Network Processor Unit)NPU (Network Processor Unit) Why an NPU?Why an NPU? Cons and pros with NPU:sCons and pros with NPU:s
The architecture of our NPUThe architecture of our NPU Design difficulties and design choicesDesign difficulties and design choices The architecture, strengths and weaknessesThe architecture, strengths and weaknesses
The big pictureThe big picture From idea to siliconFrom idea to silicon
4
Ethernet
Most widespread network technology used Most widespread network technology used in LAN (Local Area Network)in LAN (Local Area Network) 10 Mb/s (Ethernet)10 Mb/s (Ethernet) 100 Mb/s (Fast Ethernet)100 Mb/s (Fast Ethernet) 1000 Mb/s (Gigabit Ethernet)1000 Mb/s (Gigabit Ethernet)
Packet switched networkPacket switched network Host-to-host delivery on the same networkHost-to-host delivery on the same network Switches forward packets from one section to another Switches forward packets from one section to another
using the datagram paradigmusing the datagram paradigm
5
Ethernet
Datagram paradigmDatagram paradigm Packet contains enough information for a switch to Packet contains enough information for a switch to
forward it correctlyforward it correctly I.e. packet contains complete destination addressI.e. packet contains complete destination address
Ethernet packets = framesEthernet packets = frames In Ethernet the packets are referred to as framesIn Ethernet the packets are referred to as frames
6
Ethernet Frame Format
PreamblePreamble 64 bits used for synchronisation64 bits used for synchronisation
HeaderHeader 48-bit globally unique destination address48-bit globally unique destination address 48-bit globally unique source address48-bit globally unique source address 16-bit type field used for classification16-bit type field used for classification
PreambleDest addr
Source addr
Type Body CRC
8 6 6 2 46-1500 4 Bytes
7
Ethernet Frame Format
BodyBody 46-1500 bytes of data46-1500 bytes of data
CRCCRC 32-bit CRC (Cyclic Redundancy Check) for error 32-bit CRC (Cyclic Redundancy Check) for error
detectiondetection
PreambleDest addr
Source addr
Type Body CRC
8 6 6 2 46-1500 4 Bytes
8
Internetworks
InternetworkInternetwork Several physical networks combined into one logical Several physical networks combined into one logical
internetworkinternetwork Also called internet (with lowercase “i”)Also called internet (with lowercase “i”) Most famous is the world spanning Internet (with capital “I”)Most famous is the world spanning Internet (with capital “I”)
Host-to-host delivery between different networksHost-to-host delivery between different networks
9
Internet Protocol (IP)
Most widespread protocol used in Most widespread protocol used in internetworksinternetworks
Routers forward packets from one network Routers forward packets from one network to another using the datagram paradigmto another using the datagram paradigm
10
IP Packet Format
12 bytes of status fields e.g. version, length etc12 bytes of status fields e.g. version, length etc 32-bit globally unique source address32-bit globally unique source address 32-bit globally unique destination address32-bit globally unique destination address Optional fields of variable lengthOptional fields of variable length BodyBody
Ver, len etc
Source addr
Dest addr
Opt Body
12 4 4 Bytes0-65515
11
IP Over Ethernet
IP packets are encapsulated in Ethernet IP packets are encapsulated in Ethernet framesframes
P r e a m b l eD e s t a d d r
S o u r c e a d d r
T y p e B o d y C R C
V e r , l e n e t c
S o u r c e a d d r
D e s t a d d r
O p t B o d y
12
Host-To-Host Communication
H
S
H
R R
H
S
H
Network 1 Network 2 Network 3
13
Devices
SwitchCore CXE-2010SwitchCore CXE-2010 A 16-port Gigabit Ethernet Switch-on-a-chipA 16-port Gigabit Ethernet Switch-on-a-chip Full 4K VLAN supportFull 4K VLAN support Includes support of IEEE 802.1pIncludes support of IEEE 802.1p
Cisco 1710Cisco 1710 Security Access RouterSecurity Access Router Secure Internet, intranet, and extranet access with VPN and Secure Internet, intranet, and extranet access with VPN and
firewallfirewall Advanced QoS featuresAdvanced QoS features
14
Features
What if we want:What if we want: Load BalancingLoad Balancing
distributing client requests across multiple serversdistributing client requests across multiple servers
Multi-Protocol Label Switching (MPLS)Multi-Protocol Label Switching (MPLS) next hop based on a the labelnext hop based on a the label
15
Features
What if we don’t wantWhat if we don’t want QoS QoS Security featuresSecurity features
The Network Processor Unit (NPU)The Network Processor Unit (NPU) A programmable CPU chip that is optimized for networking and A programmable CPU chip that is optimized for networking and
communications functionscommunications functions Quick adaptation of new standards/featuresQuick adaptation of new standards/features
16
Conditions For the Work
1 GE (1000 Mbit) port1 GE (1000 Mbit) port 8 FE (100 Mbit) ports8 FE (100 Mbit) ports ScalableScalable
Add more portsAdd more ports Remove ports Remove ports
Feasible to make an ASIC prototypeFeasible to make an ASIC prototype
17
NPU components:NPU components: Processor CoreProcessor Core Embedded softwareEmbedded software Network InterfaceNetwork Interface Packet buffersPacket buffers QueuesQueues TablesTables Switch fabricSwitch fabric
18
Design Choices
Processor coreProcessor core RISC basedRISC based Network specificNetwork specific
Network InterfaceNetwork Interface FEFE
MII (Media Independent Interface)MII (Media Independent Interface) RMII (Reduced MII)RMII (Reduced MII)
GEGE GMII (Gigabit MII)GMII (Gigabit MII) RGMII (Reduced GMII)RGMII (Reduced GMII)
19
Design Choices
QueuesQueues A packet ready for transmissionA packet ready for transmission
TablesTables Data structure for IP & MAC addressesData structure for IP & MAC addresses
Switch fabricSwitch fabric The internal interconnect architecture. The internal interconnect architecture.
How to transport from in-port to out-port?How to transport from in-port to out-port?
20
Design Choices
Packet buffersPacket buffers Internal and/or externalInternal and/or external How many times do we need to access a (buffer) How many times do we need to access a (buffer)
memory?memory? Write when receive from networkWrite when receive from network Read packet for processingRead packet for processing Write modified packet for transmissionWrite modified packet for transmission Reading the packet when transmittingReading the packet when transmitting For N ports the memory needs to run at 4N the port speedFor N ports the memory needs to run at 4N the port speed
21
Design Choices
8 FE ports8 FE ports 1 GE port1 GE port
Inter-arrival time:Inter-arrival time: 1.5*101.5*1066 + 8*1.5 + 8*1.555 = 2.7*10 = 2.7*1066 packets/s packets/s -> New packet every 370 ns-> New packet every 370 ns
Cycle budget example:Cycle budget example: 100 MHz -> 37 cycles to process every packet100 MHz -> 37 cycles to process every packet 200 MHz -> 74 cycles to process every packet200 MHz -> 74 cycles to process every packet
22
Design Choices
Model of operationModel of operation Route processingRoute processing Packet forwardingPacket forwarding
~200 cycles~200 cycles Special servicesSpecial services
Target technologyTarget technology ~150 MHz~150 MHz
23
Design Decisions
2 FE ports2 FE ports 125 MHz125 MHz 1 Integer Unit1 Integer Unit
1 GE port1 GE port 125 MHz125 MHz 5 Integer Units5 Integer Units
Interactive voice can tolerate somewhere Interactive voice can tolerate somewhere between 100 and 200 milliseconds of end-between 100 and 200 milliseconds of end-to-end delay without people noticing it.to-end delay without people noticing it.
420 cycles -> 0.00336 ms420 cycles -> 0.00336 ms
-> Cycle budget of 420 for each packet-> Cycle budget of 420 for each packet
Parallel Processor ArchitectureParallel Processor Architecture
24
Design Decisions
TablesTables MAC Address lookup, fixed length:MAC Address lookup, fixed length: CAM (Content Addressable Memory)CAM (Content Addressable Memory)
Pros: FastPros: Fast Cons: ExpensiveCons: Expensive Like a cacheLike a cache
IP Address lookup, longest match:IP Address lookup, longest match: Possibly large tablePossibly large table External SRAMExternal SRAM
25
Internal packet buffers: Internal packet buffers: Pros: Pros: Fast, less pin countFast, less pin count
Cons: Cons: Limited size of memoryLimited size of memory
MAC
MAC
MAC
MAC
Shared memoryPacket buffer
Packet buffer
Packet buffer
Input
2 FE ports / 1 buffer2 FE ports / 1 bufferPros: Pros: Reduce contention, Reduce contention,
reduce 4N problemreduce 4N problem
Cons: Cons: Less effective use of memoryLess effective use of memory
26
Virtual output queues: Virtual output queues: Pros: Pros: No Head Of Line (HOL) blocking,No Head Of Line (HOL) blocking,
Possible to select any packet from buffer memoryPossible to select any packet from buffer memory
Cons: Cons: Expensive in hardwareExpensive in hardware
MAC
MAC
MAC
MAC
Packet buffer
Input
MAC
MAC
MAC
MAC
OutputVirtual Output Queues
Virtual Output Queues
12
3
4
Packet buffer12
3
4
27
NPU Architecture
SFPU
CAM SRAM
RU TU
Receiving Units
Processing Units
Switching Fabric
Transmitting Units
Shared Resources
1.8 Gbps 1.8 Gbps
28
32 (to SF)
MIPS IU
Arb
MemCtrl (Instr)
Trans- mitter
Frame Engine
8kB SRAM
1kB SRAM
CAM I/O
Shared SRAM I/O
420 cycles / min size packet
1 transmit / 20 cycles (FE) or 1 transmitt / 4 cycles (GE)
MemCtrl (Data)
1kB SRAM
128 (from RU)
128
128
32 32
32
24
32
3 accesses / 40 cycles (not counting accesses from IU)
PU with 1xIU
29
MIPS IU
Arb
MemCtrl (Instr)
Trans- mitter
Frame Engine
32kB SRAM
1kB SRAM
CAM I/O
Shared SRAM I/O
420 cycles / min size packet
1 accesses / 32 cycles (not counting accesses from IUs)
1 transmit / 5 cycles
MemCtrl (Data)
1kB SRAM
512 (from RU)
512
512
32 32
ArbArb
Arb
32
24
32
32 (to SF)
PU with 5xIU
30
Performance
0
50
100
150
200
250
50 100 150 200
Frames
Cy
cle
s
IP in shared SRAM
IP in internal SRAM
MAC in shared CAM
31
Strengths in the Architecture
More bandwidthMore bandwidth More RU and TUMore RU and TU New types of RU and TUNew types of RU and TU
More processing powerMore processing power More PU per RU/TUMore PU per RU/TU More IU per PUMore IU per PU New types of PUNew types of PU New types of IUNew types of IU
32
Strengths in the Architecture
New functionalityNew functionality New types of shared resourcesNew types of shared resources
SemaphoresSemaphores Multipurpose CPUMultipurpose CPU
New softwareNew software All IU:s can run different softwareAll IU:s can run different software
33
Weaknesses in the Architecture
Not everything scales wellNot everything scales well Shared resourcesShared resources No. of IU:s in a PUNo. of IU:s in a PU
34
ASIC design flow
From Idea to SiliconDesign Entry
Logic Synthesis
Floor-planning
Placement
Routing
VHDL/Verilog
Transfer to target technology(TSMC 0.18)
Arrange blocks on chip
Decide location of cells in a block
Make connections betweencells and blocksCircuit
exctraction
Postlayout simulation
Finished
DesignSpecification
35
ALU : process(alu_RegA, alu_RegB, In_Ctrl_Ex) begin case In_Ctrl_Ex.OP is when ALU_ADD => alu_Result <= alu_RegA + alu_RegB; when ALU_SUB => alu_Result <= alu_RegA - alu_RegB; when ALU_AND => alu_Result <= alu_RegA and alu_RegB; when ALU_OR => alu_Result <= alu_RegA or alu_RegB; when ALU_XOR => alu_Result <= alu_RegA xor alu_RegB; when ALU_NOR => alu_Result <= alu_RegA nor alu_RegB; when others => alu_Result <= (others => '-'); end case; end process;
Layout
2.6 x 2.6 mm