Internetworking: Hardware/Software Interface

Internetworking: Hardware/Software Interface

CS 213, LECTURE 16L.N. Bhuyan

04/19/23 CS258 S99 2

Protocols: HW/SW Interface

• Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently;

– Enabling technologies: SW standards that allow reliable communications without reliable networks

– Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites

• Transmission Control Protocol/Internet Protocol (TCP/IP)

– This protocol family is the basis of the Internet

– IP makes best effort to deliver; TCP guarantees delivery

– TCP/IP used even when communicating locally: NFS uses IP even though communicating across homogeneous LAN

04/19/23 CS258 S99 3

TCP/IP packet

• Application sends message• TCP breaks into 64KB

segements, adds 20B header

• IP adds 20B header, sends to network

• If Ethernet, broken into 1500B packets with headers, trailers

• Header, trailers have length field, destination, window number, version, ...

TCP data(≤ 64KB)

TCP Header

IP Header

IP Data

Ethernet

CPU

User

Kernel

NIC NIC

PCI Bus

Communicating with the Server: The O/S Wall

Problems:• O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP)

• O/S interrupt • Data copying from kernel space to user space and vice versa

• Oh, the PCI Bottleneck!

04/19/23 CS258 S99 5

The Send/Receive Operation

• The application writes the transmit data to the TCP/IP sockets interface for transmission in payload sizes ranging from 4 KB to 64 KB.

• The data is copied from the User space to the Kernel space

• The OS segments the data into maximum transmission unit (MTU)–size packets, and then adds TCP/IP header information to each packet.

• The OS copies the data onto the network interface card (NIC) send queue.

• The NIC performs the direct memory access (DMA) transfer of each data packet from the TCP buffer space to the NIC, and interrupts CPU activities to indicate completion of the transfer.

04/19/23 CS258 S99 6

Transmitting data across the memory bus using a standard NIC

http://www.dell.com/downloads/global/power/1q04-her.pdf

04/19/23 CS258 S99 7

Timing Measurement in UDP Communication

X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005

04/19/23 CS258 S99 8

I/O Acceleration Techniques

• TCP Offload: Offload TCP/IP Checksum and Segmentation to Interface hardware or programmable device (Ex. TOEs) – A TOE-enabled NIC using Remote Direct Memory Access (RDMA) can use zero-copy algorithms to place data directly into application buffers.

• O/S Bypass: User-level software techniques to bypass protocol stack – Zero Copy Protocol

(Needs programmable device in the NIC for direct user level memory access – Virtual to Physical Memory Mapping. Ex. VIA)

• Architectural Techniques: Instruction set optimization, Multithreading, copy engines, onloading, prefetching, etc.

04/19/23 CS258 S99 9

Comparing standard TCP/IP and TOE enabled TCP/IP stacks

(http://www.dell.com/downloads/global/power/1q04-her.pdf)

04/19/23 CS258 S99 10

Chelsio 10 Gbs TOE

04/19/23 CS258 S99 11

Cluster (Network) of Workstations/PCs

04/19/23 CS258 S99 12

Myrinet Interface Card

04/19/23 CS258 S99 13

InfiniBand Interconnection

• Zero-copy mechanism. The zero-copy mechanism enables a user-level application to perform I/O on the InfiniBand fabric without being required to copy data between user space and kernel space.

• RDMA. RDMA facilitates transferring data from remote memory to local memory without the involvement of host CPUs.

• Reliable transport services. The InfiniBand architecture implements reliable transport services so the host CPU is not involved in protocol-processing tasks like segmentation, reassembly, NACK/ACK, etc.

• Virtual lanes. InfiniBand architecture provides 16 virtual lanes (VLs) to multiplex independent data lanes into the same physical lane, including a dedicated VL for management operations.

• High link speeds. InfiniBand architecture defines three link speeds, which are characterized as 1X, 4X, and 12X, yielding data rates of 2.5 Gbps, 10 Gbps, and 30 Gbps, respectively.

Reprinted from Dell Power Solutions, October 2004. BY ONUR

CELEBIOGLU, RAMESH RAJAGOPALAN, AND RIZWAN ALI

04/19/23 CS258 S99 14

InfiniBand system fabric

04/19/23 CS258 S99 15

UDP Communication – Life of a Packet

X. Zhang, L. Bhuyan and W. Feng, “Anatomy of UDP and M-VIA for Cluster Communication” Journal of Parallel and Distributed Computing (JPDC), Special issue on Design and Performance of Networks for Super-, Cluster-, and Grid-Computing, Vol. 65, Issue 10, October 2005, pp. 1290-1298.

04/19/23 CS258 S99 16

Timing Measurement in UDP Communication

X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005

04/19/23 CS258 S99 17

Network Bandwidth is Increasing

1010

100100

4040

GH

z a

nd

Gb

ps

GH

z a

nd

Gb

ps

TimeTime19901990 19951995 20002000 20032003 20052005 20102010

.01.01

0.10.1

11

1010

100100

10001000

2006/72006/7

Network bandwidth outpaces

Moore’s Law

Moore’s Law

TCP requirements Rule of thumb:1GHz for 1Gbps

The gap between the rate of processing network applications

and the fast growing network bandwidth is increasing

04/19/23 CS258 S99 18

Profile of a Packet

System Overheads

Descriptor & Header Accesses

Total Avg Clocks / Packet: ~ 21KEffective Bandwidth: 0.6 Gb/s

(1KB Receive)

IP Processing

TCB Accesses

TCP Processing

Memory Copy

Computes

Memory

04/19/23 CS258 S99 19

Five Emerging Technologies

• Optimized Network Protocol Stack (ISSS+CODES, 2003)

• Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004)

• Network Stack Affinity Scheduling

• Direct Cache Access

• Lightweight Threading

• Memory Copy Engine (ICCD 2005 and IEEE TC)

04/19/23 CS258 S99 20

Stack Optimizations (Instruction Count)

• Separate Data & Control Paths– TCP data-path focused

– Reduce # of conditionals

– NIC assist logic (L3/L4 stateless logic)

• Basic Memory Optimizations– Cache-line aware data structures

– SW Prefetches

• Optimized Computation– Standard compiler capability

3X reduction in Instructions per

Packet

04/19/23 CS258 S99 21

ChipsetChipset

Me

mo

ryM

em

ory

Me

mo

ryM

em

ory

CPUCPU

Me

mo

ryM

em

ory

Me

mo

ryM

em

ory

Me

mo

ryM

em

ory

Me

mo

ryM

em

ory

Mem

ory

Mem

ory

CPUCPU

Network Stack Affinity

Dedicated for network I/OIntel calls it Onloading

I/O

In

terf

ace

I/O

In

terf

ace

CPU

Cor

eC

ore

Cor

eC

ore

Cor

eC

ore

Cor

eC

ore

Cor

eC

ore

Cor

eC

ore

Cor

eC

ore

Cor

eC

ore

…

Assigns network I/O workloads to designated devices Separates network I/O from application work

Reduces scheduling overheads More efficient cache utilization Increases pipeline efficiency

04/19/23 CS258 S99 22

Direct Cache Access (DCA)

Normal DMA Writes

CPUCPU

Cache

MemoryMemory

NICNIC

Memory Controller

Memory Controller

Step 1DMA Write

Step 2Snoop Invalidate

Step 3Memory Write

Step 4CPU Read

Direct Cache Access

CPUCPU

Cache

MemoryMemory

NICNIC

Memory Controller

Memory Controller

Step 1DMA Write

Step 2Cache Update

Step 3CPU Read

Eliminate 3 to 25 memory accesses by placing packet data directly into cache

04/19/23 CS258 S99 23

Lightweight Threading

Single Hardware Context

Execution pipelineExecution pipelineS/W controlled thread 1

S/W controlled thread 2

Memory informing event (e.g. cache miss)

Continue computing in single pipeline in shadow of cache miss

Single Core Pipeline

Thread ManagerThread Manager

Builds on helper threads; reduces CPU stall

04/19/23 CS258 S99 24

Potential Efficiencies (10X)

On CPU, multi-gigabit, line speed On CPU, multi-gigabit, line speed network I/O is possiblenetwork I/O is possible

On CPU, multi-gigabit, line speed On CPU, multi-gigabit, line speed network I/O is possiblenetwork I/O is possible

Benefits of Affinity Benefits of Architectural TechnquesGreg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004

04/19/23 CS258 S99 25

I/O Acceleration – Problem MagnitudeI/O Acceleration – Problem Magnitude

I/O Processing Overheads

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

50,000

TCP Orig TCP Opt iSCSI SSL XML

Protocol Processing

Cycle

s (

Ban

ias 1

.7 G

Hz)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Data

Rate

(G

bp

s)

Cycles

Data Rate

Memory Copies &Memory Copies &Effects of StreamingEffects of Streaming

CRCsCRCs Crypto Crypto Parsing,Parsing,Tree ConstructionTree Construction

Storage over IP

Storage over IP

Networking

Networking

Security

Security

Services

Services

I/O Processing Rates are significantly limited by CPU in the face of Data I/O Processing Rates are significantly limited by CPU in the face of Data Movement and Transformation OperationsMovement and Transformation Operations

Documents

Internetworking: Hardware/Software Interface