Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers

Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann

Institute for Informatics, University of Oslo, Norway

Multimedia and Telecommunications Track (MTT ’02) –

28th EUROMICRO Conference,Dortmund, Germany, September 2002

© 2002 Pål HalvorsenMTT’02, Dortmund, Germany, September 2002

Overview

Application scenario

The INSTANCE project

Zero-copy data paths static buffer allocation performance evaluation

Summary and conclusions


Network

Application ScenarioMedia-on-Demand server:Applicable in applications like News- or Video-on-Demand provided by city-wide cable or pay-per-view companies

Multimedia Storage Server

Project goals:Optimize performance within a single server:• Reduce resource requirements • Maximize number of clients

Retrieval is the bottleneck:Some important factors:• Memory management• Communication protocol processing• Error management

Network


The INSTANCE Project We try to make optimal use of a

given set of resources:

network level framing

integrated error management

memory architecturememory architecture periodic broadcast service dynamic zero-copy buffers static zero-copy buffersstatic zero-copy buffers

Project goals:Optimize performance within a single server:• Reduce resource requirements • Maximize number of clients


General Operating System Structure and Data Path

file systemcommunication

system

application

user space

kernel space


Pentium 4Processor

registers

cache(s)

Example: Intel Hub Architecture (850 Chipset) – II

I/Ocontroller

hub

memorycontroller

hub

RDRAM

RDRAM

RDRAM

RDRAM

PCI slots

PCI slots

PCI slots

system bus(64-bit, 400/533 MHz)

hub interface(four 8-bit, 66 MHz)

PCI bus(32-bit, 33 MHz)

RAM interface(two 64-bit, 200 MHz)

network card

disk

file system

communication system

application


system

application

disk network card

Note:these transfers only show data movement between sub-systems. Additionally, data touching operations within a sub-system will require that data is moved from memory and to the CPU, e.g.: - checksum calculation - encryption - data encoding - forward error correction

Thus, copy operations is expensive:

bandwidth is limited

consumes CPU cycles

affects the cache



system

application

user space

kernel space

Zero-Copy: Basic Idea

bus(es)

mbufbuf

b_data m_data



system

application

mbuf memory pools

mbuf

mbuf cluster

user space memory

buf memory pools

buf

buf cluster

Zero-Copy: Dynamic Allocation


Zero-Copy: Static Allocation

Allocate all needed memory during stream initialization

If possible, set all buf and mbuf data pointers

Use alternating buffers

header

dataarea

data pointer

mbuf pointer

buf pointer

bufs

mbufs


Zero-Copy: Operations

Stream initialization

Read operation

Send operation

Stream close

currently used buffer

header

dataarea

bufs

header

dataarea

bufs

mbufs

send offset send offset

currently used buffer

mbufs


Performance: Test Setup

Implemented in NetBSD

Dell Precision Workstation 620 PIII 1 GHz CPU 100 Mbps network card Single disk storage

Software probe to measure allocation times RDTSC instruction CPUID instruction probe overhead 206 cycles


Evaluation: Zero-Copy Transfer Rate

Throughput increase of ~2.7 times per stream (can at least double the number of clients)

Zero-copy transfer rate limited by network cardand storage system

A later dynamic version: saturated a 1

Gbps NIC

reduced processing time by approximately 50 %

huge improvement in number of concurrent streams

approx. 12 Mbps

approx. 6 Mbps


Evaluation: Static Allocation Saves time to get and free memory regions

malloc – 5.80 µs, free – 6.48 µs get_poolitem – 0.15 µs, put_poolitem – 0.15 µs e.g., 1 GB file, 64 KB disk blocks, 1 KB packets

retrieving 1 GB 16 K disk I/Os (1 buf, 1 region each) sending 1 GB 1 M packets (2 mbufs each, sharing data region) totally 2 M + 32 K get and free operations

0.63 s sending the whole file assuming a pool (takes totally about 10 s, or 7s kernel time, to send having fast devices)

Might save time to set data pointers and length fields

Inflexible (variable bit rate streams)

Strict waiting on static buffers

Saves CPU cycles at the cost of statically allocating memory


Conclusions and Future Work Zero-copy reduces data movement overhead in the

OS(reduces processing time by approximately 50 %)

Static versus dynamic allocation of zero-copy buffers tradeoff between flexibility and CPU resources static saves CPU, but inflexible dynamic is flexible, but adds allocation costs we will use our dynamic implementation in our future work

Ongoing and future work: Tune dynamic implementation (ongoing) Zero-copy network–disk path (ongoing) Add memory caching


Questions??

Documents

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute