Upload
randolph-stuart-jefferson
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers
Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann
Institute for Informatics, University of Oslo, Norway
Multimedia and Telecommunications Track (MTT ’02) –
28th EUROMICRO Conference,Dortmund, Germany, September 2002
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Overview
Application scenario
The INSTANCE project
Zero-copy data paths static buffer allocation performance evaluation
Summary and conclusions
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Network
Application ScenarioMedia-on-Demand server:Applicable in applications like News- or Video-on-Demand provided by city-wide cable or pay-per-view companies
Multimedia Storage Server
Project goals:Optimize performance within a single server:• Reduce resource requirements • Maximize number of clients
Retrieval is the bottleneck:Some important factors:• Memory management• Communication protocol processing• Error management
Network
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
The INSTANCE Project We try to make optimal use of a
given set of resources:
network level framing
integrated error management
memory architecturememory architecture periodic broadcast service dynamic zero-copy buffers static zero-copy buffersstatic zero-copy buffers
Project goals:Optimize performance within a single server:• Reduce resource requirements • Maximize number of clients
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
General Operating System Structure and Data Path
file systemcommunication
system
application
user space
kernel space
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Pentium 4Processor
registers
cache(s)
Example: Intel Hub Architecture (850 Chipset) – II
I/Ocontroller
hub
memorycontroller
hub
RDRAM
RDRAM
RDRAM
RDRAM
PCI slots
PCI slots
PCI slots
system bus(64-bit, 400/533 MHz)
hub interface(four 8-bit, 66 MHz)
PCI bus(32-bit, 33 MHz)
RAM interface(two 64-bit, 200 MHz)
network card
disk
file system
communication system
application
file systemcommunication
system
application
disk network card
Note:these transfers only show data movement between sub-systems. Additionally, data touching operations within a sub-system will require that data is moved from memory and to the CPU, e.g.: - checksum calculation - encryption - data encoding - forward error correction
Thus, copy operations is expensive:
bandwidth is limited
consumes CPU cycles
affects the cache
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
file systemcommunication
system
application
user space
kernel space
Zero-Copy: Basic Idea
bus(es)
mbufbuf
b_data m_data
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
file systemcommunication
system
application
mbuf memory pools
mbuf
mbuf cluster
user space memory
buf memory pools
buf
buf cluster
Zero-Copy: Dynamic Allocation
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Zero-Copy: Static Allocation
Allocate all needed memory during stream initialization
If possible, set all buf and mbuf data pointers
Use alternating buffers
header
dataarea
data pointer
mbuf pointer
buf pointer
bufs
mbufs
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Zero-Copy: Operations
Stream initialization
Read operation
Send operation
Stream close
currently used buffer
header
dataarea
bufs
header
dataarea
bufs
mbufs
send offset send offset
currently used buffer
mbufs
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Performance: Test Setup
Implemented in NetBSD
Dell Precision Workstation 620 PIII 1 GHz CPU 100 Mbps network card Single disk storage
Software probe to measure allocation times RDTSC instruction CPUID instruction probe overhead 206 cycles
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Evaluation: Zero-Copy Transfer Rate
Throughput increase of ~2.7 times per stream (can at least double the number of clients)
Zero-copy transfer rate limited by network cardand storage system
A later dynamic version: saturated a 1
Gbps NIC
reduced processing time by approximately 50 %
huge improvement in number of concurrent streams
approx. 12 Mbps
approx. 6 Mbps
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Evaluation: Static Allocation Saves time to get and free memory regions
malloc – 5.80 µs, free – 6.48 µs get_poolitem – 0.15 µs, put_poolitem – 0.15 µs e.g., 1 GB file, 64 KB disk blocks, 1 KB packets
retrieving 1 GB 16 K disk I/Os (1 buf, 1 region each) sending 1 GB 1 M packets (2 mbufs each, sharing data region) totally 2 M + 32 K get and free operations
0.63 s sending the whole file assuming a pool (takes totally about 10 s, or 7s kernel time, to send having fast devices)
Might save time to set data pointers and length fields
Inflexible (variable bit rate streams)
Strict waiting on static buffers
Saves CPU cycles at the cost of statically allocating memory
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Conclusions and Future Work Zero-copy reduces data movement overhead in the
OS(reduces processing time by approximately 50 %)
Static versus dynamic allocation of zero-copy buffers tradeoff between flexibility and CPU resources static saves CPU, but inflexible dynamic is flexible, but adds allocation costs we will use our dynamic implementation in our future work
Ongoing and future work: Tune dynamic implementation (ongoing) Zero-copy network–disk path (ongoing) Add memory caching
© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002
Questions??