View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Realizing the Performance Potential of the Virtual Interface Architecture
Evan Speight, Hazim Abdel-Shafi, and John K. Bennett
Rice University, Dep. Of Electrical and Computer Engineering
Presented by Constantin Serban, R.U.
VIA Goals
• Communication infrastructure for System Area Networks (SANs)
• Targets mainly high speed cluster applications
• Efficiently harnesses the communication performance of underlying networks
Trends
• The peak bandwidth increase two order of magnitude over past decade while user latency decreased modestly.
• The latency introduced by the protocol is typically several times the latency of the transport layer.
• The problem becomes acute especially for small messages
Targets
VI architecture addresses the following issues:
• Decrease the latency especially for small messages (used in synchronization)
• Increase the aggregate bandwidth (only a fraction of the peak bandwidth is utilized)
• Reduce the CPU processing due to the message overhead
Overhead
Overhead mainly comes from two sources:• Every network access requires one-two
traps into the kernel – user/kernel mode switch is time consuming
• Usually two data copies occur:– From the user buffer to the message passing
API– From message layer to the kernel buffer
VIA approach
• Remove the kernel from the critical path – Moving communication code out of the kernel
into user space
• Provide 0-copy protocol– Data is sent/received directly into the user
buffer, no message copy is performed
VIA emerged as a standardization effort from Compaq, Intel, and Microsoft
It was built on several academic ideas: • The main architecture most similar to U-Net• Essential features derived from VMMCAmong current implementations :
– GigaNet cLan – VIA implemented in hardware– Tandem ServerNet –VIA software driver
emulated– Myricom Myrinet - software emulated in
firmware
VIA architecture
VIA operationsSet-Up/Tear-Down :• VIA is point-to-point connection oriented protocol• VI-endpoint : the core concept in VIA
• Register/De-Register Memory• Connect/Disconnect• Transmit• Receive• RDMA
VIA operationsSet-Up/Tear-Down :VIA is point-to-point
connection oriented protocol• VI-endpoint : the core concept in VIA• VipCreateVi function creates a VI endpoint in the
user space.• The user-level library passes the call to the kernel
agent which passes the creation information to the NIC.
• OS thus controls the application access to the NIC
VIA operations - cont’dRegister/De-Register Memory:• All data buffers and descriptors reside in a
registered memory • NIC performs DMA I/O operation in this
registered memory• Registration pins down the pages into the physical
memory and provides a handle to manipulate the pages and transfer the addresses to the NIC
• It is performed once, usually at the beginning of the communication session
VIA operations - cont’dConnect/Disconnect:
• Before communication, each endpoint is connected to a remote endpoint
• The connection is passed to the kernel agent and down to the NIC
• VIA does not define any addressing scheme, existing schemes can be used in various implementations
VIA operations - cont’dTransmit/receive:• The sender builds a descriptor for the message to
be sent. The descriptor points to the actual data buffer. Both descriptor and data buffer resides in a registered memory area.
• The application then posts a doorbell to signal the availability of the descriptor.The doorbell contains the address of the descriptor.
• The doorbells are maintained in an internal queue inside the NIC
VIA operations - cont’dTransmit/receive (cont’d):• Meanwhile, the receiver creates a descriptor that
points to an empty data buffer and posts a doorbell in the receiver NIC queue
• When the doorbell in the sender queue has reached the top of the queue, through a double indirection the data is sent into the network.
• The first doorbell/ descriptor is picked up from the receiver queue and the buffer is filled out with data
VIA operations - cont’dRDMA:• As a mechanism derived from VMMC, VIA allows
Remote DMA operations: RDMA Read and Write
• Each node allocates a receive buffer and registers it with the NIC. Additional structures that contain read and write pointers to the receive buffers are exchanged during connection setu
• Each node can read and write to the remote node address directly.
• These operations posts potential implementation problems.
Evaluation Benchmarks
• Two VI implementations :– GigaNet cLan B:125MB/sec, Latency 480ns – Tandem ServerNet, 50MB/S, Latency 300ns
• Performance measured:– Bandwidth and Latency – Poling vs. Blocking– CPU Utilization
Bandwidth
Latency
Latency Polling/Blocking
CPU utilization
MPI performance using VIA
• The challenge is to deliver performance to distributed application
• Software layers such MPI are mostly used between VIA and the application: provide increased usability but they bring additional overhead
• How to optimize this layer in order to use it efficiently with VIA ?
MPI VIA - performance
MPI observations• Difference between MPI-UDP and MPI-
VIA-baseline is remarkable
• MPI-VIA-baseline is dramatically far from VIA-Native
• Several improvements proposed to shift MPI-Via to be closer to VIA native : reduce MPI overhead
MPI Improvements
• Eliminating unnecessary copies:MPI UDP and VIA use a single set of receiving buffers,
thus data should be copied to the application : allow the user to register any buffer
• Choosing a synchronization primitive:All synchronization formerly using OS constructs/events.
Better implementation using swap processor commands
• No Acknowledge: Remove the acknowledge of the message by switching to
a reliable VIA mode
VIA - Disadvantages
• Polling vs. blocking synchronization – a tradeoff between CPU consumption and overhead
• Memory registration: locking large amount of memory makes virtual memory mechanisms inefficient. Registering / deregistering on the fly is slow
• Point-to-point vs. multicast: VIA lacks multicast primitives. Implementing multicast over the actual mechanism, makes communication inefficient
Conclusion
• Small latency for small messages. Small messages have a strong impact on application behavior
• Significant improvement over UDP communication (still after recent TCP/UDP hardware implementations?)
• At the expense of an uncomfortable API